[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

Tim Allison (JIRA) Mon, 12 Aug 2019 10:20:56 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16905402#comment-16905402
 ]


Tim Allison commented on TIKA-2910:
-----------------------------------

No need to wait for me on this one...

The boilerpipe handler isn't relevant here.  What is relevant is that we've 
hardcoded the HTMLParser to handle XML in tika-server, but we don't do that in 
tika-app.

On TIKA-2551, we fixed that in the master branch (Tika 2.0), but we didn't fix 
it in {{branch_1x}} because it would be a change in behavior.

If fellow devs are willing to make this breaking change in the 1.x branch, we 
can do that for 1.23.  Any objections to making this change in 1.x?

> Text extraction using Tika command line and Tika server differs
> ---------------------------------------------------------------
>
>                 Key: TIKA-2910
>                 URL: https://issues.apache.org/jira/browse/TIKA-2910
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.21
>            Reporter: Walter
>            Priority: Major
>              Labels: newbie
>         Attachments: CorpusP_25471990.xml
>
>
> When extracting TXT from the very same XML file using either Tika command 
> line utility or the Tika in server mode, the results differ.
> It looks as if PCDATA in deeper nested XML structures are just ignored and 
> only an empty line is returned.
> I assume both use the same base code. Are there any default settings that may 
> differ or can be set?
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (TIKA-2910) Text extraction using Tika command line and Tika server differs

Reply via email to