Hey Alex, what version of Tika Python are you using? And moreover what version of Tika? I’m CC’ing folks on [email protected] hope you don’t mind.
I took the file you attached and saved it as blah.txt and ran tika-python (with 1.9 tika) against it: [mattmann-0420740:~] mattmann% tika-python detect type blah.txt tika.py: Retrieving http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/ 1.9/tika-server-1.9.jar to /var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar. [(200, u'text/plain')] [mattmann-0420740:~] mattmann% tika-python language file blah.txt [(200, u'en')] [mattmann-0420740:~] mattmann% Is what what you would expect? In general the language detection using N-grams and gets better when there is more text as a sample but it can get fooled sometimes too. Let me know what you think. Cheers, Chris CC / memex-jpl@ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Alejandro Caceres <[email protected]> Date: Thursday, July 16, 2015 at 11:53 AM To: jpluser <[email protected]> Cc: Amanda Towler <[email protected]> Subject: Tika Issue? >Hey Chris, > > >I was about to submit this as a bug, but figured I'd run it by you first. >Maybe you've encountered a similar issue. > > >I'm doing some basic language categorization of websites, I saw that the >Tika server/tika-python returns content as plain text, which is great to >send to Tika language categorization (and just generally useful). However, > it seemed to get very confused with sites that have footers in various >languages, this is actually really common in the results we've found. For >example, we have a totally English site and at the bottom is some links >to the same site in other languages. This > page, even though it's mostly English, gets categorized as a seemingly >random language (like Lithuanian). > > >As a workaround we tried running the web pages through a text >summarization algo using lxml-readability, which gives us back a subset >of the text on a page. My thinking was this would most likely strip >footers and headers > and give us back a decent representative sample of text on the page. The >results seem to have improved a bit, but we're still getting some funky >results where English pages are categorized as a seemingly random >language, in many cases these pages seem pretty > obviously (to the human eye) to be English. > > >I wonder if someone at JPL (I don't see anyone from JPL here right now) >could shed some light on why this might be happening. I've attached a >couple of samples below. Also let me know if you'd like me to file any >bugs anywhere > to better track this, I just wanted to shoot this to you first to see if >perhaps I was missing something obvious. > > > >Alex > > >-- >___ > >Alejandro Caceres >Hyperion Gray, LLC >Owner/CTO >
