Re: Is there a way to process very large header-only mbox files efficiently with mime4j?

Tim Clotworthy Fri, 15 Sep 2023 12:27:42 -0700

________________________________
From: Benoit TELLIER <btell...@linagora.com>
Sent: Friday, September 15, 2023 3:02 AM
To: mime4j-dev@james.apache.org <mime4j-dev@james.apache.org>; 
mime4j-dev@james.apache.org <mime4j-dev@james.apache.org>
Subject: Re: Is there a way to process very large header-only mbox files 
efficiently with mime4j?


This email was sent from an external server

Hello Tim.

Which version of MIME4J are you using?

How long is too long?

I think having a JMH micro-benchmark covering your use case might be relevant! 
I also think we have no benchmarks for mbox parsing...

Regarding your solution, wouldn't it be simpler to just ignore these 5MB long 
messages and keep parsing the mbox files nonetheless?  (MimeConfig can limit 
maximum header count)
Or process these 5MB byte long messages as body? (EG happending /r/n at the 
beginning) If fonctionnaly acceptable that would be easier than touching the 
heart of mime4j...

--


Best regards,



Benoit TELLIER



General manager of Linagora VIETNAM.

Product owner for Team-Mail product.

Chairman of the Apache James project.



Mail: btell...@linagora.com

Tel: (0033) 6 77 26 04 58 (WhatsApp, Signal)

On Sep 15, 2023 1:01 AM, from Tim Clotworthy Good afternoon,

I have data parsing challenge related to our use of mime4j. We encounter mbox 
data that is unconventional in structure, but we are required to process 
nonetheless. The particular mbox files we are having issues with are very large 
(some over 5MB), and are headers-only. Mime4j likely is parsing the files 
properly, but the time it takes is prohibitively long.

We use the MimeTokenStream parser. We don't believe this can be addressed in 
configuration (i.e. via MimeConfig).

An ideal situation would be to be able to specify if the number of headers 
processed exceeds maxHeaders, then stop parsing, reset the stream pointer to 
the beginning of the input stream and just output as one giant header (or 
body?) or, probably more realistically, chuck the output in chunks manageable 
for whatever is reasonable for IO parsing of this nature.

Otherwise, I guess it's a custom coding solution? It would appear that it would 
perhaps involve a custom parser that extends or borrows from MimeStreamParser 
or MimeTokenStream, or both. For instance, below is the critical area of code 
from MimeStreamParser where we want to avoid getting stuck in processing these 
5 MB header-only files.

Grateful for any response. Thanks!


while(true) {
            EntityState state = this.mimeTokenStream.getState();
            switch (state) {
                case T_BODY:
                    BodyDescriptor desc = 
this.mimeTokenStream.getBodyDescriptor();
                    InputStream bodyContent;
                    if (this.contentDecoding) {
                        bodyContent = 
this.mimeTokenStream.getDecodedInputStream();
                    } else {
                        bodyContent = this.mimeTokenStream.getInputStream();
                    }












                                                                                
                                                                                
       ,


----------

This email has been scanned for spam and viruses by Proofpoint Essentials. 
Visit the following link to report this email as spam:
https://us3.proofpointessentials.com/index01.php?mod_id=11&mod_option=logitem&report=1&type=easyspam&k=k1&payload=53616c7465645f5ff8e306fc18639253d1611f78328d2458a8e4e28b45aae22221bfed6fa357bdb86f323a933fa1f81c4d64bcfd0a53d7afb40df160caa846e23d18487428aee2dffaee2b13bff928f6acd1bdea39adde2bf41b26d2dad90bc3a49e56d8f92840a70ad4fba04e6a10cd68ed22708883b82cb6d3cde03f2892228d25a92c5bb4c57e82ceff95a7456d1109a39265c12625048d00593f256a8482&mail_id=1694761387-l9vs4YBIJp_Z&r_address=tclotworthy%40bainova.com

Re: Is there a way to process very large header-only mbox files efficiently with mime4j?

Reply via email to