[ 
https://issues.apache.org/jira/browse/TIKA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15001773#comment-15001773
 ] 

Vjeran Marcinko commented on TIKA-1788:
---------------------------------------

I dunno James library at all, so cannot say if this would affect negatively 
some other portion of the parser, but...

Thing is that current Tika's RFC822Parser sets indirectly James' 
BasicBodyDescriptor instead of MaximalBodyDescriptor, and this is due to the 
way RFC822Parser instantiates james' MimeStreamParser internally. If this 
instantiation would be by specifying DefaultBodyDescriptorBuilder:
{code}
MimeStreamParser parser = new MimeStreamParser(config, null, new 
DefaultBodyDescriptorBuilder());
{code}
This way during James' parsing, the MaximalBodyDescriptor would be created 
which recognizes Content-Disposition field, and it could be utilized in Tika's 
MailContentHandler, say in body(...) method if we add:
{code}
    public void body(BodyDescriptor body, InputStream is) throws MimeException,
            IOException {
        // use a different metadata object
        // in order to specify the mime type of the
        // sub part without damaging the main metadata

        Metadata submd = new Metadata();
        submd.set(Metadata.CONTENT_TYPE, body.getMimeType());
        submd.set(Metadata.CONTENT_ENCODING, body.getCharset());
        
        if (body instanceof MaximalBodyDescriptor) {
            MaximalBodyDescriptor maximalBodyDescriptor = 
(MaximalBodyDescriptor) body;
            String contentDispositionFilename = 
maximalBodyDescriptor.getContentDispositionFilename();
            if (contentDispositionFilename != null) {
                submd.set(Metadata.RESOURCE_NAME_KEY, 
contentDispositionFilename);
            }
        }
...
{code}

> message/rfc822 parser doesn't identify attachment filenames from 
> Content-Disposition header
> -------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1788
>                 URL: https://issues.apache.org/jira/browse/TIKA-1788
>             Project: Tika
>          Issue Type: Bug
>    Affects Versions: 1.11
>            Reporter: Sergey Tsalkov
>         Attachments: grep_content_disposition.zip
>
>
> rfc822 email files can contain attachments as subparts, and they'll
> generally specify the filename of the attachment in a manner like
> this:
> Content-Disposition: attachment;
>         filename*=utf-8''image001.jpg
> Tika doesn't seem to be grabbing that information at all!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to