[
https://issues.apache.org/jira/browse/TIKA-3768?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17551871#comment-17551871
]
Sam Stephens commented on TIKA-3768:
------------------------------------
Ah, interesting, this is a case of me misunderstanding the product then.
This means that in order to actually get all the text possible out of a file, I
need to examine both the actual text and the metadata (I'm using this for
building a search over documents of many types).
The challenge then is that some fields in the metadata object are sourced from
text in the document (such as {{dc:subject}} and {{{}Message-From{}}}) and
should be searchable, and some that are not (such as {{Content-Type}} and
{{{}X-TIKA:Parsed-By{}}}), and should not be searchable.
Is there any documentation of the set of possible metadata fields? The
constants inherited by
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html]
don't appear to be a complete set, as I don't see {{dc:subject}} amongst them.
It looks to me like I could strip out fields like {{Content-Type}} as listed in
[https://tika.apache.org/2.4.0/api/org/apache/tika/metadata/Metadata.html] and
any fields with names prefixed by {{{}X-TIKA:{}}}, and all remaining fields
would be sourced from document text.
> message/rfc822 does not include Headers in extracted text
> ---------------------------------------------------------
>
> Key: TIKA-3768
> URL: https://issues.apache.org/jira/browse/TIKA-3768
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.4.0
> Reporter: Sam Stephens
> Priority: Major
> Attachments: email.txt
>
>
> When running AutoDetectParser on message/rfc822 structured text documents,
> such as the attached [^email.txt], the extracted text does not include any of
> the headers, such as the Subject and From and To lines.
> However these lines contain useful text I'd like to be able to extract. I'm
> surprised it's not there based on the include everything bias I saw on
> https://issues.apache.org/jira/browse/TIKA-3710.
> Interestingly, if I exclude org.apache.tika.parser.mail.RFC822Parser as a
> parser, my debugging appears to show
> org.apache.tika.parser.csv.TextAndCSVParser being used for parsing, and we
> get the full text, but the returned content type is 'message/rfc822;
> charset=windows-1252'.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)