Hi Tim,

Understood, so the only difference between the /stream and /string endpoint
is the bytestream to UTF-8 conversion.

With the change on the wiki is more clear that the file parsing is limited
to that.

Thank you

Cheers

Juan

On Thu, Dec 5, 2019, 17:21 Tim Allison <[email protected]> wrote:

> I just updated our wiki.  Please let me know if we can improve it further.
>
>
> https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource
>
> On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <[email protected]> wrote:
>
> > In looking at the source code for this (for the first time?)...it looks
> > like that endpoint expects UTF-8 text.  It does not parse the file and
> then
> > run lang id on the parsed text.
> >
> > On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[email protected]>
> wrote:
> >
> >> Hi all,
> >>
> >> Since this is my first email allow me to give some context: my name is
> >> Juan
> >> Elosua and I have come across tika for document parsing for an
> information
> >> security project we are working on.
> >>
> >> First of all sorry if this is not the way to send potential issues along
> >> but I was unsure how to communicate them.
> >>
> >> The potential issue I found concerns tika-server version 1.22 and more
> >> precisely the language detector interface.
> >>
> >> If I send a PDF document to that endpoint it returns *'th' (thai) *as
> the
> >> detected language but the pdf document is in spanish. I have converted
> the
> >> pdf to a plain text file (using pdftotext) and rerun the test and then
> the
> >> language has been detected correctly as *'es'*
> >>
> >>
> >>
> >>
> >>
> >>
> >> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
> >> http://localhost:9998/language/stream
> >> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
> >> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
> >> <http://localhost:9998/language/stream>es*
> >>
> >> I have used a publicly available pdf file to ease the replication, you
> can
> >> find the original document here:
> >> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
> >>
> >> Please, let me know what's the best way to report issues.
> >>
> >> Saw the "reporting issues" docs for tika, but should I create an account
> >> in
> >> order to report the issues or is that something internal to the core
> team?
> >>
> >> Thanks in advance
> >>
> >> Juan
> >>
> >
>

Reply via email to