Hey Alex,
You nailed it. The first 2 examples are that there is too small a sample
of text for it to get the language categorization right. Feel free to file
a Tika issue for this (http://issues.apache.org/jira/browse/TIKA). There
was
some talk about integrating the Google language detector into this since
it’s
ALv2, not sure if it will perform better with smaller samples or not.
As for the latter example, this one is interesting. The main reason is
that it’s not detecting the file as HTML, since it doesn’t have a file
extension,
and since its MIME magic for that page doesn’t match the traditional HTML
magic
(e.g., <html>.. blah). So, it’s parsing that with the TxtParser which well
just
extracts out the characters/text from the stream:
>>> print string_parsed["metadata"]
{u'Content-Encoding': [u'ISO-8859-1'], u'Content-Type': [u'text/plain;
charset=ISO-8859-1'], u'X-TIKA:parse_time_millis': [u'58'],
u'X-Parsed-By': [[u'org.apache.tika.parser.DefaultParser',
u'org.apache.tika.parser.txt.TXTParser']]}
>>>
Try this. You can use the from_file method to parse URLs as well. Those
URLs will be downloaded in Tika python to /tmp as files, and then parsed
from there. If you use .from_file on the above, it will correctly just
strip
the text out, and then the language detector works. Try this:
from tika import parser
from tika import language
parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html")
lang = language.from_buffer(parsed["content"])
print lang
Which should print out:
>>> parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html")
tika.py: Retrieving http://ferretspatternu.ucoz.com/index.html to
/tmp/index.html.
>>> lang = language.from_buffer(parsed["content"])
>>> print lang
en
>>>
Note that I had to add /index.html at the end. Our code to create the
tmp file in tika-python needs some prettying up so if the URL doesn’t end
with an actual file or extension it craps out. For now you can work around
it that way. If you have time please file a Github issue for tika-python
and we’ll make this better.
Cheers HTH!
Cheers,
Chris
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Chief Architect
Instrument Software and Science Data Systems Section (398)
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 168-519, Mailstop: 168-527
Email: [email protected]
WWW: http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Associate Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-----Original Message-----
From: Alejandro Caceres <[email protected]>
Date: Thursday, July 16, 2015 at 12:56 PM
To: jpluser <[email protected]>
Cc: Amanda Towler <[email protected]>, "[email protected]"
<[email protected]>, "[email protected]"
<[email protected]>
Subject: Re: Tika Issue?
>Gah I messed up the bug story. You're right, that text is categorized as
>en, I screwed up with the file. Here is a better/more accurate summary of
>what I'm seeing, with some examples. Pretend the previous email
> was all a terrible dream.
>
>
>There appear to be two potential issues going on, let's start with the
>language categorization because I already brought it up. I've attached 3
>files below for reference. language_no_1 and language_no_2 are
> both picked up as Norwegian, I suspect this is because there's a small
>amount of text. language_lt_2 is probably the most interesting to me,
>this text is picked up as Lithuanian, seems to have a good amount of
>text, but has a footer that is in various languages.
> I suspected that was throwing it off, however most of the text is
>definitely English so perhaps something else is going on.
>
>
>The other issue I'm seeing is with the parser, but maybe I've
>misunderstood something. Here is some code:
>
>
>import requests
>from tika import parser
>from tika import language
>
>
>r = requests.get("http://ferretspatternu.ucoz.com/")
>string_parsed = parser.from_buffer(r.text)
>lang = language.from_buffer(string_parsed["content"])
>print string_parsed["content"]
>print lang
>
>
>The language is picked up Lithuanian, however I see why. The "content"
>field looks like it is not plain text, but instead raw HTML. In other
>documents this field looks like it contains sanitized text... or am I
>missing something?
>
>
>
>Anyway, hope that's all a little bit clearer! Let me know what you think.
>
>
>
>Alex
>
>
>PS this is with the latest version of tika-python running tika server 1.9
>
>
>
>
>
>
>
>
>
>
>On Thu, Jul 16, 2015 at 2:58 PM, Mattmann, Chris A (3980)
><[email protected]> wrote:
>
>Hey Alex, what version of Tika Python are you using? And moreover
>what version of Tika? I’m CC’ing folks on [email protected] hope you
>don’t mind.
>
>I took the file you attached and saved it as blah.txt and ran
>tika-python (with 1.9 tika) against it:
>
>[mattmann-0420740:~] mattmann% tika-python detect type blah.txt
>tika.py: Retrieving
>http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server
>/
>1.9/tika-server-1.9.jar to
>/var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
>[(200, u'text/plain')]
>[mattmann-0420740:~] mattmann% tika-python language file blah.txt
>[(200, u'en')]
>[mattmann-0420740:~] mattmann%
>
>Is what what you would expect? In general the language detection using
>
>N-grams and gets better when there is more text as a sample but it can
>get fooled sometimes too.
>
>Let me know what you think.
>
>Cheers,
>Chris
>
>CC / memex-jpl@
>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Chris Mattmann, Ph.D.
>Chief Architect
>Instrument Software and Science Data Systems Section (398)
>NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>Office: 168-519, Mailstop: 168-527
>Email: [email protected]
>WWW:
>http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>Adjunct Associate Professor, Computer Science Department
>University of Southern California, Los Angeles, CA 90089 USA
>++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>-----Original Message-----
>From: Alejandro Caceres <[email protected]>
>Date: Thursday, July 16, 2015 at 11:53 AM
>To: jpluser <[email protected]>
>Cc: Amanda Towler <[email protected]>
>Subject: Tika Issue?
>
>>Hey Chris,
>>
>>
>>I was about to submit this as a bug, but figured I'd run it by you first.
>>Maybe you've encountered a similar issue.
>>
>>
>>I'm doing some basic language categorization of websites, I saw that the
>>Tika server/tika-python returns content as plain text, which is great to
>>send to Tika language categorization (and just generally useful).
>>However,
>> it seemed to get very confused with sites that have footers in various
>>languages, this is actually really common in the results we've found. For
>>example, we have a totally English site and at the bottom is some links
>>to the same site in other languages. This
>> page, even though it's mostly English, gets categorized as a seemingly
>>random language (like Lithuanian).
>>
>>
>>As a workaround we tried running the web pages through a text
>>summarization algo using lxml-readability, which gives us back a subset
>>of the text on a page. My thinking was this would most likely strip
>>footers and headers
>> and give us back a decent representative sample of text on the page. The
>>results seem to have improved a bit, but we're still getting some funky
>>results where English pages are categorized as a seemingly random
>>language, in many cases these pages seem pretty
>> obviously (to the human eye) to be English.
>>
>>
>>I wonder if someone at JPL (I don't see anyone from JPL here right now)
>>could shed some light on why this might be happening. I've attached a
>>couple of samples below. Also let me know if you'd like me to file any
>>bugs anywhere
>> to better track this, I just wanted to shoot this to you first to see if
>>perhaps I was missing something obvious.
>>
>>
>>
>>Alex
>>
>>
>>--
>>___
>>
>>Alejandro Caceres
>>Hyperion Gray, LLC
>>Owner/CTO
>>
>
>
>
>
>
>
>
>
>
>
>--
>___
>
>Alejandro Caceres
>Hyperion Gray, LLC
>Owner/CTO
>