Re: Wrong language detection in tika server 1.22

Tim Allison Thu, 05 Dec 2019 07:45:03 -0800

In looking at the source code for this (for the first time?)...it looks
like that endpoint expects UTF-8 text.  It does not parse the file and then
run lang id on the parsed text.


On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[email protected]> wrote:

> Hi all,
>
> Since this is my first email allow me to give some context: my name is Juan
> Elosua and I have come across tika for document parsing for an information
> security project we are working on.
>
> First of all sorry if this is not the way to send potential issues along
> but I was unsure how to communicate them.
>
> The potential issue I found concerns tika-server version 1.22 and more
> precisely the language detector interface.
>
> If I send a PDF document to that endpoint it returns *'th' (thai) *as the
> detected language but the pdf document is in spanish. I have converted the
> pdf to a plain text file (using pdftotext) and rerun the test and then the
> language has been detected correctly as *'es'*
>
>
>
>
>
>
> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
> http://localhost:9998/language/stream
> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
> <http://localhost:9998/language/stream>es*
>
> I have used a publicly available pdf file to ease the replication, you can
> find the original document here:
> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
>
> Please, let me know what's the best way to report issues.
>
> Saw the "reporting issues" docs for tika, but should I create an account in
> order to report the issues or is that something internal to the core team?
>
> Thanks in advance
>
> Juan
>

Re: Wrong language detection in tika server 1.22

Reply via email to