[ 
https://issues.apache.org/jira/browse/TIKA-3560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17424178#comment-17424178
 ] 

Tim Allison commented on TIKA-3560:
-----------------------------------

I updated the metadata section in our wiki page "migrating to tika 2.x" today.  
I looked into subject, and it looks like we were putting "keywords" into 
subject in 1.x as well as into keywords.  We've kept that behavior in 2.x.  I'm 
not sure why there's an array in 2.x but not in 1.x.  Those should be the same. 

In 2.1.1-SNAPSHOT, I added empty checks for subject, keywords, title and other 
keys in the MSOffice parsers.  They used to allow an empty string for string 
based metadata values. 

> Tika 2.0 (and 2.1) parses doc with less fidelity than using 1.24.1
> ------------------------------------------------------------------
>
>                 Key: TIKA-3560
>                 URL: https://issues.apache.org/jira/browse/TIKA-3560
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.0.0, 2.1.0
>         Environment: Windows 10
>            Reporter: Josh Burchard
>            Priority: Major
>         Attachments: Capture.jpg
>
>
> I'm parsing an old .doc file and I'm sending my request to the /rmeta/text 
> endpoint. I see that some metadata fields that were returned to me from Tika 
> 1.24.1 are no longer returned in 2.0 and above.
> I diffed the output between 1.24.1, 2.0 and 2.1.  Please see the screenshot 
> I've attached. 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to