Hey guys.

I want to learn Hadoop and use my gmail mbox file as a basis for that.
That brought me to mime4j and for the most part this is working out great.
Thank you, btw.

But it took me a while to get my hands on a version of it. So I am
wondering if this project is still active? The download link [1] is broken.
Is this the official download link?

After some time I found apache-mime4j-*-0.8.0-SNAPSHOT.jars. But it is a
higher version number as the [official|outdated] 0.7.2 and it also seems
different then what I can see in the trunk. Or at least it seemed to me as
the DefaultMessageBuilder is not part of the trunk, but in 0.8.0.

Or maybe I am getting it all wrong, no idea. Tried to look at the sources,
but it seems then I also need to learn Maven – as it needs to generate
source – and I would rather spent my time on Hadoop.

Anyway, I ran into two issues, see subject.

It seems that Mime4j doesn't like long lines in the input and so the
parsing fails.

Here is such a long line that looks legit to me. At least it is not a
corner case as it is created by Google Groups.

References: <[email protected]> <
[email protected]> <
[email protected]>
<[email protected]> <
[email protected]> <
[email protected]> <
[email protected]> <
calqcipbuu+cp2memlx_7qs-hqmitq9rem2kjhgdbp4yesu+...@mail.gmail.com> <
[email protected]> <
6ae3238c-3b54-488d-bb80-51e35afdb...@hd10g2000pbc.googlegroups.com> <
8928045d-68fa-4bfc-98f1-7fddc24f7...@kt16g2000pbb.googlegroups.com> <
[email protected]> <
[email protected]>
<CALQcipYW4Qcwu5b1DhpZtXUHtCsVfjygPYosx7Sgt=fvb-3...@mail.gmail.com> <
[email protected]> <
[email protected]> <
[email protected]> <413a
 [email protected]> <
[email protected]> <
[email protected]> <
[email protected]> <
[email protected]>

Here is an excerpt from the stacktrace:
Caused by: org.apache.james.mime4j.io.MaxLineLimitException: Maximum line
length limit exceeded
at
org.apache.james.mime4j.io.BufferedLineReaderInputStream.readLine(BufferedLineReaderInputStream.java:218)
 at
org.apache.james.mime4j.io.LineReaderInputStreamAdaptor.readLine(LineReaderInputStreamAdaptor.java:78)
at
org.apache.james.mime4j.stream.MimeEntity.readRawField(MimeEntity.java:215)


I attached the full offending message. At the end of the message you'll
find the stacktrace.

But to put things in perspective. I processed my mails from the last ten
years (100k+) and it only had issues with a few hundred. So it's not a
biggie, but wanted to give you this feedback. And I can provide you with
more examples if you need those.

Furthermore I ran into a NPE as well.

java.lang.NullPointerException
at
org.apache.james.mime4j.io.MimeBoundaryInputStream.<init>(MimeBoundaryInputStream.java:67)
 at
org.apache.james.mime4j.stream.MimeEntity.createMimePartStream(MimeEntity.java:366)
at org.apache.james.mime4j.stream.MimeEntity.advance(MimeEntity.java:320)
 at
org.apache.james.mime4j.stream.MimeTokenStream.next(MimeTokenStream.java:368)
at
org.apache.james.mime4j.parser.MimeStreamParser.parse(MimeStreamParser.java:176)
 at
org.apache.james.mime4j.message.DefaultMessageBuilder.parseMessage(DefaultMessageBuilder.java:316)
at com.mboxanalytics.util.MboxUtil.parseMessage(MboxUtil.java:95)
 ...


UnsupportedEncodingException and IllegalCharsetNameException available as
well, but probably correct. Full list:

 RecordLevelErrors

 java.io.UnsupportedEncodingException/ISO-8859-8-I=1

 java.io.UnsupportedEncodingException/ansi_x3.110-1983=1

 java.io.UnsupportedEncodingException/csn_369103=1

 java.io.UnsupportedEncodingException/en.UTF-8=3

 java.io.UnsupportedEncodingException/iso-88592=4

 java.io.UnsupportedEncodingException/unicode-1-1-utf-7=3

 java.io.UnsupportedEncodingException/unknown-8bit=1

 java.io.UnsupportedEncodingException/us-as=1

 java.io.UnsupportedEncodingException/windows-1252http-equivCont=1

 java.io.UnsupportedEncodingException/x-utf8utf8=1

 java.lang.NullPointerException/null=1

 java.nio.charset.IllegalCharsetNameException/\"us-ascii\"=2

 java.nio.charset.IllegalCharsetNameException/iso-8859-1 Content=1

 java.nio.charset.IllegalCharsetNameException/windows-1252 chars=1

 org.apache.james.mime4j.MimeIOException/Maximum header length l=13

 org.apache.james.mime4j.MimeIOException/org.apache.james.mime4j=228

Success

 No of message successfully processed=129462

 No of words processed=64267035

Hope this is helpful and thanks again for providing this library.

Best,
Mariano

[1] http://james.apache.org/download.cgi#Apache_Mime4J

Reply via email to