[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

August Valera (Jira) Wed, 01 Oct 2025 18:45:14 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-4307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18024191#comment-18024191
 ]


August Valera commented on TIKA-4307:
-------------------------------------

Haven't seen any movement on this for a bit, any hope that this could be looked 
into? 

Or if it is just a faulty file, any suggestions for preprocessing we could do 
to make sure Tika can accept them?

> Text in header not extracted for Microsoft Word doc file
> --------------------------------------------------------
>
>                 Key: TIKA-4307
>                 URL: https://issues.apache.org/jira/browse/TIKA-4307
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.9.2
>            Reporter: August Valera
>            Priority: Major
>         Attachments: 560702J-2x-converted.doc, 560702J-converted.docx, 
> 560702J-full-output.txt, 560702J.doc, screenshot-1.png
>
>
> We have a Microsoft Word doc file with text in the header. That header text 
> is not successfully extracted alongside the file content, but converting the 
> file to a docx file results in successful extraction.
> Samples are attached, conversion done using cloudconvert.com.
>  * [^560702J.doc] Original doc file, missing content
>  * [^560702J-converted.docx] Converted to docx file, correct output
>  * [^560702J-2x-converted.doc] Docx file converted back to doc, again missing 
> content
> h3. Current Behavior
> doc files omit header text. docx files extract header text correctly.
> h3. Expected Behavior
> doc and docx files with identical content in header should result in 
> identical output



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-4307) Text in header not extracted for Microsoft Word doc file

Reply via email to