[jira] [Commented] (TIKA-3969) Inconsistent behavior on EPUB text extraction between tika-server and tika-app

Johan van der Knijff (Jira) Thu, 09 Feb 2023 07:38:07 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17686602#comment-17686602
 ]


Johan van der Knijff commented on TIKA-3969:
--------------------------------------------

Not sure if I have anything to recommend. To me as a user, the inclusion of the 
image refs is unexpected, and it could lead to unexpected results for any 
analyses done on those texts. E.g. imagine a researcher uses Tika to analyse 
the emergence of certain words or phrases through time using EPUB versions of 
19th century books. Any alt-text descriptions in such materials would most 
likely be contemporary, and as such they would "pollute" the original "signal" 
(19th century text) with modern language. But perhaps there are valid use cases 
for including them as well?

> Inconsistent behavior on EPUB text extraction between tika-server and tika-app
> ------------------------------------------------------------------------------
>
>                 Key: TIKA-3969
>                 URL: https://issues.apache.org/jira/browse/TIKA-3969
>             Project: Tika
>          Issue Type: Improvement
>          Components: tika-app, tika-server
>    Affects Versions: 2.6.0
>         Environment: I’m using a -smart toaster- PC running Linux Mint 20.1 
> (Ulyssa), MATE edition.
>            Reporter: Johan van der Knijff
>            Priority: Minor
>
> While doing some tests with Tika for text extraction from EPUB, I ran into 
> what looks like an inconsistency between the behavior of Tika-app and 
> Tika-server.
> I’m using below EPUB file as an example:
> [https://www.dbnl.org/tekst/berk011veel01_01/ebook/berk011veel01_01.epub]
> Using the tika-app JAR, I can extract the text from this EPUB using this 
> command:
> {code:java}
> java -jar ~/tika/tika-app-2.6.0.jar -t berk011veel01_01.epub > 
> berk011veel01_01_tika-app.txt{code}
> Output in this case is as expected.
> So then I tried this using the server. After firing up the server I use this:
> {code:java}
> curl -T berk011veel01_01.epub http://localhost:9998/tika --header "Accept: 
> text/plain" > berk011veel01_01-tika-server.txt{code}
> In this case, Tika’s output contains elements (between square brackets) with 
> alt-text descriptions of images. For example (from the first page of this 
> book):
>  
> {noformat}
> [image: cover]
> Aster Berkhof
> Veel geluk, professor!
> [image: DBNL]
> Colofon
> {noformat}
>  
> These image references + alt-text description don't appear in the tika-app 
> output! Not sure if this is the intended behavior, or perhaps I’m doing 
> something wrong myself, or I’m missing some obvious option?
> See also this related Tika-python issue (I initially thought this was a 
> Tika-python problem, which on closer inspection it isn't):
> [https://github.com/chrismattmann/tika-python/issues/389]
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (TIKA-3969) Inconsistent behavior on EPUB text extraction between tika-server and tika-app

Reply via email to