I just updated our wiki. Please let me know if we can improve it further. https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource
On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <[email protected]> wrote: > In looking at the source code for this (for the first time?)...it looks > like that endpoint expects UTF-8 text. It does not parse the file and then > run lang id on the parsed text. > > On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[email protected]> wrote: > >> Hi all, >> >> Since this is my first email allow me to give some context: my name is >> Juan >> Elosua and I have come across tika for document parsing for an information >> security project we are working on. >> >> First of all sorry if this is not the way to send potential issues along >> but I was unsure how to communicate them. >> >> The potential issue I found concerns tika-server version 1.22 and more >> precisely the language detector interface. >> >> If I send a PDF document to that endpoint it returns *'th' (thai) *as the >> detected language but the pdf document is in spanish. I have converted the >> pdf to a plain text file (using pdftotext) and rerun the test and then the >> language has been detected correctly as *'es'* >> >> >> >> >> >> >> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf >> http://localhost:9998/language/stream >> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary >> @BOE-A-2019-9455.txt http://localhost:9998/language/stream >> <http://localhost:9998/language/stream>es* >> >> I have used a publicly available pdf file to ease the replication, you can >> find the original document here: >> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf >> >> Please, let me know what's the best way to report issues. >> >> Saw the "reporting issues" docs for tika, but should I create an account >> in >> order to report the issues or is that something internal to the core team? >> >> Thanks in advance >> >> Juan >> >
