Hey Alex, what version of Tika Python are you using? And moreover
what version of Tika? I’m CC’ing folks on [email protected] hope you
don’t mind.

I took the file you attached and saved it as blah.txt and ran
tika-python (with 1.9 tika) against it:

[mattmann-0420740:~] mattmann% tika-python detect type blah.txt
tika.py: Retrieving
http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server/
1.9/tika-server-1.9.jar to
/var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
[(200, u'text/plain')]
[mattmann-0420740:~] mattmann% tika-python language file blah.txt
[(200, u'en')]
[mattmann-0420740:~] mattmann%

Is what what you would expect? In general the language detection using

N-grams and gets better when there is more text as a sample but it can
get fooled sometimes too.

Let me know what you think.

Cheers,
Chris

CC / memex-jpl@

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW:  http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++





-----Original Message-----
From: Alejandro Caceres <[email protected]>
Date: Thursday, July 16, 2015 at 11:53 AM
To: jpluser <[email protected]>
Cc: Amanda Towler <[email protected]>
Subject: Tika Issue?

>Hey Chris,
>
>
>I was about to submit this as a bug, but figured I'd run it by you first.
>Maybe you've encountered a similar issue.
>
>
>I'm doing some basic language categorization of websites, I saw that the
>Tika server/tika-python returns content as plain text, which is great to
>send to Tika language categorization (and just generally useful). However,
> it seemed to get very confused with sites that have footers in various
>languages, this is actually really common in the results we've found. For
>example, we have a totally English site and at the bottom is some links
>to the same site in other languages. This
> page, even though it's mostly English, gets categorized as a seemingly
>random language (like Lithuanian).
>
>
>As a workaround we tried running the web pages through a text
>summarization algo using lxml-readability, which gives us back a subset
>of the text on a page. My thinking was this would most likely strip
>footers and headers
> and give us back a decent representative sample of text on the page. The
>results seem to have improved a bit, but we're still getting some funky
>results where English pages are categorized as a seemingly random
>language, in many cases these pages seem pretty
> obviously (to the human eye) to be English.
>
>
>I wonder if someone at JPL (I don't see anyone from JPL here right now)
>could shed some light on why this might be happening. I've attached a
>couple of samples below. Also let me know if you'd like me to file any
>bugs anywhere
> to better track this, I just wanted to shoot this to you first to see if
>perhaps I was missing something obvious.
>
>
>
>Alex
>
>
>-- 
>___
>
>Alejandro Caceres
>Hyperion Gray, LLC
>Owner/CTO
>

Reply via email to