Hi Tim, Understood, so the only difference between the /stream and /string endpoint is the bytestream to UTF-8 conversion.
With the change on the wiki is more clear that the file parsing is limited to that. Thank you Cheers Juan On Thu, Dec 5, 2019, 17:21 Tim Allison <[email protected]> wrote: > I just updated our wiki. Please let me know if we can improve it further. > > > https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource > > On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <[email protected]> wrote: > > > In looking at the source code for this (for the first time?)...it looks > > like that endpoint expects UTF-8 text. It does not parse the file and > then > > run lang id on the parsed text. > > > > On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[email protected]> > wrote: > > > >> Hi all, > >> > >> Since this is my first email allow me to give some context: my name is > >> Juan > >> Elosua and I have come across tika for document parsing for an > information > >> security project we are working on. > >> > >> First of all sorry if this is not the way to send potential issues along > >> but I was unsure how to communicate them. > >> > >> The potential issue I found concerns tika-server version 1.22 and more > >> precisely the language detector interface. > >> > >> If I send a PDF document to that endpoint it returns *'th' (thai) *as > the > >> detected language but the pdf document is in spanish. I have converted > the > >> pdf to a plain text file (using pdftotext) and rerun the test and then > the > >> language has been detected correctly as *'es'* > >> > >> > >> > >> > >> > >> > >> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf > >> http://localhost:9998/language/stream > >> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary > >> @BOE-A-2019-9455.txt http://localhost:9998/language/stream > >> <http://localhost:9998/language/stream>es* > >> > >> I have used a publicly available pdf file to ease the replication, you > can > >> find the original document here: > >> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf > >> > >> Please, let me know what's the best way to report issues. > >> > >> Saw the "reporting issues" docs for tika, but should I create an account > >> in > >> order to report the issues or is that something internal to the core > team? > >> > >> Thanks in advance > >> > >> Juan > >> > > >
