Hey guys. I want to learn Hadoop and use my gmail mbox file as a basis for that. That brought me to mime4j and for the most part this is working out great. Thank you, btw.
But it took me a while to get my hands on a version of it. So I am wondering if this project is still active? The download link [1] is broken. Is this the official download link? After some time I found apache-mime4j-*-0.8.0-SNAPSHOT.jars. But it is a higher version number as the [official|outdated] 0.7.2 and it also seems different then what I can see in the trunk. Or at least it seemed to me as the DefaultMessageBuilder is not part of the trunk, but in 0.8.0. Or maybe I am getting it all wrong, no idea. Tried to look at the sources, but it seems then I also need to learn Maven – as it needs to generate source – and I would rather spent my time on Hadoop. Anyway, I ran into two issues, see subject. It seems that Mime4j doesn't like long lines in the input and so the parsing fails. Here is such a long line that looks legit to me. At least it is not a corner case as it is created by Google Groups. References: <[email protected]> < [email protected]> < [email protected]> <[email protected]> < [email protected]> < [email protected]> < [email protected]> < calqcipbuu+cp2memlx_7qs-hqmitq9rem2kjhgdbp4yesu+...@mail.gmail.com> < [email protected]> < 6ae3238c-3b54-488d-bb80-51e35afdb...@hd10g2000pbc.googlegroups.com> < 8928045d-68fa-4bfc-98f1-7fddc24f7...@kt16g2000pbb.googlegroups.com> < [email protected]> < [email protected]> <CALQcipYW4Qcwu5b1DhpZtXUHtCsVfjygPYosx7Sgt=fvb-3...@mail.gmail.com> < [email protected]> < [email protected]> < [email protected]> <413a [email protected]> < [email protected]> < [email protected]> < [email protected]> < [email protected]> Here is an excerpt from the stacktrace: Caused by: org.apache.james.mime4j.io.MaxLineLimitException: Maximum line length limit exceeded at org.apache.james.mime4j.io.BufferedLineReaderInputStream.readLine(BufferedLineReaderInputStream.java:218) at org.apache.james.mime4j.io.LineReaderInputStreamAdaptor.readLine(LineReaderInputStreamAdaptor.java:78) at org.apache.james.mime4j.stream.MimeEntity.readRawField(MimeEntity.java:215) I attached the full offending message. At the end of the message you'll find the stacktrace. But to put things in perspective. I processed my mails from the last ten years (100k+) and it only had issues with a few hundred. So it's not a biggie, but wanted to give you this feedback. And I can provide you with more examples if you need those. Furthermore I ran into a NPE as well. java.lang.NullPointerException at org.apache.james.mime4j.io.MimeBoundaryInputStream.<init>(MimeBoundaryInputStream.java:67) at org.apache.james.mime4j.stream.MimeEntity.createMimePartStream(MimeEntity.java:366) at org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:320) at org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:368) at org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176) at org.apache.james.mime4j.message.DefaultMessageBuilder.parseMessage(DefaultMessageBuilder.java:316) at com.mboxanalytics.util.MboxUtil.parseMessage(MboxUtil.java:95) ... UnsupportedEncodingException and IllegalCharsetNameException available as well, but probably correct. Full list: RecordLevelErrors java.io.UnsupportedEncodingException/ISO-8859-8-I=1 java.io.UnsupportedEncodingException/ansi_x3.110-1983=1 java.io.UnsupportedEncodingException/csn_369103=1 java.io.UnsupportedEncodingException/en.UTF-8=3 java.io.UnsupportedEncodingException/iso-88592=4 java.io.UnsupportedEncodingException/unicode-1-1-utf-7=3 java.io.UnsupportedEncodingException/unknown-8bit=1 java.io.UnsupportedEncodingException/us-as=1 java.io.UnsupportedEncodingException/windows-1252http-equivCont=1 java.io.UnsupportedEncodingException/x-utf8utf8=1 java.lang.NullPointerException/null=1 java.nio.charset.IllegalCharsetNameException/\"us-ascii\"=2 java.nio.charset.IllegalCharsetNameException/iso-8859-1 Content=1 java.nio.charset.IllegalCharsetNameException/windows-1252 chars=1 org.apache.james.mime4j.MimeIOException/Maximum header length l=13 org.apache.james.mime4j.MimeIOException/org.apache.james.mime4j=228 Success No of message successfully processed=129462 No of words processed=64267035 Hope this is helpful and thanks again for providing this library. Best, Mariano [1] http://james.apache.org/download.cgi#Apache_Mime4J
