I just updated our wiki.  Please let me know if we can improve it further.

https://cwiki.apache.org/confluence/display/TIKA/TikaJAXRS#TikaJAXRS-LanguageResource

On Thu, Dec 5, 2019 at 10:44 AM Tim Allison <[email protected]> wrote:

> In looking at the source code for this (for the first time?)...it looks
> like that endpoint expects UTF-8 text.  It does not parse the file and then
> run lang id on the parsed text.
>
> On Thu, Dec 5, 2019 at 6:43 AM Juan Elosua <[email protected]> wrote:
>
>> Hi all,
>>
>> Since this is my first email allow me to give some context: my name is
>> Juan
>> Elosua and I have come across tika for document parsing for an information
>> security project we are working on.
>>
>> First of all sorry if this is not the way to send potential issues along
>> but I was unsure how to communicate them.
>>
>> The potential issue I found concerns tika-server version 1.22 and more
>> precisely the language detector interface.
>>
>> If I send a PDF document to that endpoint it returns *'th' (thai) *as the
>> detected language but the pdf document is in spanish. I have converted the
>> pdf to a plain text file (using pdftotext) and rerun the test and then the
>> language has been detected correctly as *'es'*
>>
>>
>>
>>
>>
>>
>> *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf
>> http://localhost:9998/language/stream
>> <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary
>> @BOE-A-2019-9455.txt http://localhost:9998/language/stream
>> <http://localhost:9998/language/stream>es*
>>
>> I have used a publicly available pdf file to ease the replication, you can
>> find the original document here:
>> https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf
>>
>> Please, let me know what's the best way to report issues.
>>
>> Saw the "reporting issues" docs for tika, but should I create an account
>> in
>> order to report the issues or is that something internal to the core team?
>>
>> Thanks in advance
>>
>> Juan
>>
>

Reply via email to