[
https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024191#comment-18024191
]
August Valera commented on TIKA-4307:
-------------------------------------
Haven't seen any movement on this for a bit, any hope that this could be looked
into?
Or if it is just a faulty file, any suggestions for preprocessing we could do
to make sure Tika can accept them?
> Text in header not extracted for Microsoft Word doc file
> --------------------------------------------------------
>
> Key: TIKA-4307
> URL: https://issues.apache.org/jira/browse/TIKA-4307
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 2.9.2
> Reporter: August Valera
> Priority: Major
> Attachments: 560702J-2x-converted.doc, 560702J-converted.docx,
> 560702J-full-output.txt, 560702J.doc, screenshot-1.png
>
>
> We have a Microsoft Word doc file with text in the header. That header text
> is not successfully extracted alongside the file content, but converting the
> file to a docx file results in successful extraction.
> Samples are attached, conversion done using cloudconvert.com.
> * [^560702J.doc] Original doc file, missing content
> * [^560702J-converted.docx] Converted to docx file, correct output
> * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing
> content
> h3. Current Behavior
> doc files omit header text. docx files extract header text correctly.
> h3. Expected Behavior
> doc and docx files with identical content in header should result in
> identical output
--
This message was sent by Atlassian Jira
(v8.20.10#820010)