Wrong language detection in tika server 1.22

Juan Elosua Thu, 05 Dec 2019 03:44:00 -0800

Hi all,

Since this is my first email allow me to give some context: my name is Juan
Elosua and I have come across tika for document parsing for an information
security project we are working on.


First of all sorry if this is not the way to send potential issues along
but I was unsure how to communicate them.

The potential issue I found concerns tika-server version 1.22 and more
precisely the language detector interface.

If I send a PDF document to that endpoint it returns *'th' (thai) *as the
detected language but the pdf document is in spanish. I have converted the
pdf to a plain text file (using pdftotext) and rerun the test and then the
language has been detected correctly as *'es'*






*$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
http://localhost:9998/language/stream
<http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
@BOE-A-2019-9455.txt http://localhost:9998/language/stream
<http://localhost:9998/language/stream>es*

I have used a publicly available pdf file to ease the replication, you can
find the original document here:
https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf

Please, let me know what's the best way to report issues.

Saw the "reporting issues" docs for tika, but should I create an account in
order to report the issues or is that something internal to the core team?

Thanks in advance

Juan

Wrong language detection in tika server 1.22

Reply via email to