In looking at the source code for this (for the first time?)...it looks like that endpoint expects UTF-8 text. It does not parse the file and then run lang id on the parsed text.
On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[email protected]> wrote: > Hi all, > > Since this is my first email allow me to give some context: my name is Juan > Elosua and I have come across tika for document parsing for an information > security project we are working on. > > First of all sorry if this is not the way to send potential issues along > but I was unsure how to communicate them. > > The potential issue I found concerns tika-server version 1.22 and more > precisely the language detector interface. > > If I send a PDF document to that endpoint it returns *'th' (thai) *as the > detected language but the pdf document is in spanish. I have converted the > pdf to a plain text file (using pdftotext) and rerun the test and then the > language has been detected correctly as *'es'* > > > > > > > *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf > http://localhost:9998/language/stream > <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary > @BOE-A-2019-9455.txt http://localhost:9998/language/stream > <http://localhost:9998/language/stream>es* > > I have used a publicly available pdf file to ease the replication, you can > find the original document here: > https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf > > Please, let me know what's the best way to report issues. > > Saw the "reporting issues" docs for tika, but should I create an account in > order to report the issues or is that something internal to the core team? > > Thanks in advance > > Juan >
