Hi all, Since this is my first email allow me to give some context: my name is Juan Elosua and I have come across tika for document parsing for an information security project we are working on.
First of all sorry if this is not the way to send potential issues along but I was unsure how to communicate them. The potential issue I found concerns tika-server version 1.22 and more precisely the language detector interface. If I send a PDF document to that endpoint it returns *'th' (thai) *as the detected language but the pdf document is in spanish. I have converted the pdf to a plain text file (using pdftotext) and rerun the test and then the language has been detected correctly as *'es'* *$ curl -X PUT --data-binary @BOE-A-2019-9455.pdf http://localhost:9998/language/stream <http://localhost:9998/language/stream>th$ curl -X PUT --data-binary @BOE-A-2019-9455.txt http://localhost:9998/language/stream <http://localhost:9998/language/stream>es* I have used a publicly available pdf file to ease the replication, you can find the original document here: https://www.boe.es/boe/dias/2019/06/24/pdfs/BOE-A-2019-9455.pdf Please, let me know what's the best way to report issues. Saw the "reporting issues" docs for tika, but should I create an account in order to report the issues or is that something internal to the core team? Thanks in advance Juan
