[ 
https://issues.apache.org/jira/browse/TIKA-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12915708#action_12915708
 ] 

Julien Nioche commented on TIKA-461:
------------------------------------

Nick, 

Thanks for taking the time to review my patch. 

bq. It'd probably be good to see some more tests with it. For now, just 
checking your basic message should be fine, but I'd suggest we also try to get 
an email with plain text, html, images and similar in to check the more complex 
bits.

Agreed

bq. In terms of the nested parser, I'm tempted to say we do something so that 
plain text comes out without any extra work needed. Anything else gets handled 
via a Parser fetched from the ParseContext if required, much as we're doing for 
container formats like zip, .docx etc. That way, you can throw a simple email 
at it and get the text, but the rest of the parts are available if you want them

I hadn't noticed that you've added org.apache.tika.extractor, seems an elegant 
way of doing. Will have a closer look and see how I can leverage it in  
RFC822Parser

bq.  Also, the james jars need to be listed in the tika bundle pom so they get 
properly included 

Ok, did not know about that. Thanks

> RFC822 messages not parsed
> --------------------------
>
>                 Key: TIKA-461
>                 URL: https://issues.apache.org/jira/browse/TIKA-461
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 0.7
>            Reporter: Joshua Turner
>            Assignee: Julien Nioche
>         Attachments: TIKA-461.patch
>
>
> Presented with an RFC822 message exported from Thunderbird, AutodetectParser 
> produces an empty body, and a Metadata containing only one key-value pair: 
> "Content-Type=message/rfc822". Directly calling MboxParser likewise gives an 
> empty body, but with two metadata pairs: "Content-Encoding=us-ascii 
> Content-Type=application/mbox".
> A quick peek at the source of MboxParser shows that the implementation is 
> pretty naive. If the wiring can be sorted out, something like Apache James' 
> mime4j might be a better bet.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to