Hey Chris,

Awesome! That answers all of my questions in one neat package. Thanks for
the help, I'll file the issues you mentioned.

This is all great stuff btw, tika-python is super clean and easy to use.

Alex

On Thu, Jul 16, 2015 at 4:04 PM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Hey Alex,
>
> You nailed it. The first 2 examples are that there is too small a sample
> of text for it to get the language categorization right. Feel free to file
> a Tika issue for this (http://issues.apache.org/jira/browse/TIKA). There
> was
> some talk about integrating the Google language detector into this since
> it’s
> ALv2, not sure if it will perform better with smaller samples or not.
>
> As for the latter example, this one is interesting. The main reason is
> that it’s not detecting the file as HTML, since it doesn’t have a file
> extension,
> and since its MIME magic for that page doesn’t match the traditional HTML
> magic
> (e.g., <html>.. blah). So, it’s parsing that with the TxtParser which well
> just
> extracts out the characters/text from the stream:
>
> >>> print string_parsed["metadata"]
> {u'Content-Encoding': [u'ISO-8859-1'], u'Content-Type': [u'text/plain;
> charset=ISO-8859-1'], u'X-TIKA:parse_time_millis': [u'58'],
> u'X-Parsed-By': [[u'org.apache.tika.parser.DefaultParser',
> u'org.apache.tika.parser.txt.TXTParser']]}
> >>>
>
> Try this. You can use the from_file method to parse URLs as well. Those
>
> URLs will be downloaded in Tika python to /tmp as files, and then parsed
> from there. If you use .from_file on the above, it will correctly just
> strip
> the text out, and then the language detector works. Try this:
>
> from tika import parser
> from tika import language
>
> parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html";)
> lang = language.from_buffer(parsed["content"])
> print lang
>
>
> Which should print out:
>
> >>> parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html
> ")
> tika.py: Retrieving http://ferretspatternu.ucoz.com/index.html to
> /tmp/index.html.
> >>> lang = language.from_buffer(parsed["content"])
> >>> print lang
> en
> >>>
>
>
> Note that I had to add /index.html at the end. Our code to create the
> tmp file in tika-python needs some prettying up so if the URL doesn’t end
> with an actual file or extension it craps out. For now you can work around
> it that way. If you have time please file a Github issue for tika-python
> and we’ll make this better.
>
> Cheers HTH!
>
> Cheers,
> Chris
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
> -----Original Message-----
> From: Alejandro Caceres <[email protected]>
> Date: Thursday, July 16, 2015 at 12:56 PM
> To: jpluser <[email protected]>
> Cc: Amanda Towler <[email protected]>, "[email protected]"
> <[email protected]>, "[email protected]"
> <[email protected]>
> Subject: Re: Tika Issue?
>
> >Gah I messed up the bug story. You're right, that text is categorized as
> >en, I screwed up with the file. Here is a better/more accurate summary of
> >what I'm seeing, with some examples. Pretend the previous email
> > was all a terrible dream.
> >
> >
> >There appear to be two potential issues going on, let's start with the
> >language categorization because I already brought it up. I've attached 3
> >files below for reference. language_no_1 and language_no_2 are
> > both picked up as Norwegian, I suspect this is because there's a small
> >amount of text. language_lt_2 is probably the most interesting to me,
> >this text is picked up as Lithuanian, seems to have a good amount of
> >text, but has a footer that is in various languages.
> > I suspected that was throwing it off, however most of the text is
> >definitely English so perhaps something else is going on.
> >
> >
> >The other issue I'm seeing is with the parser, but maybe I've
> >misunderstood something. Here is some code:
> >
> >
> >import requests
> >from tika import parser
> >from tika import language
> >
> >
> >r = requests.get("http://ferretspatternu.ucoz.com/";)
> >string_parsed = parser.from_buffer(r.text)
> >lang = language.from_buffer(string_parsed["content"])
> >print string_parsed["content"]
> >print lang
> >
> >
> >The language is picked up Lithuanian, however I see why. The "content"
> >field looks like it is not plain text, but instead raw HTML. In other
> >documents this field looks like it contains sanitized text... or am I
> >missing something?
> >
> >
> >
> >Anyway, hope that's all a little bit clearer! Let me know what you think.
> >
> >
> >
> >Alex
> >
> >
> >PS this is with the latest version of tika-python running tika server 1.9
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >On Thu, Jul 16, 2015 at 2:58 PM, Mattmann, Chris A (3980)
> ><[email protected]> wrote:
> >
> >Hey Alex, what version of Tika Python are you using? And moreover
> >what version of Tika? I’m CC’ing folks on [email protected] hope you
> >don’t mind.
> >
> >I took the file you attached and saved it as blah.txt and ran
> >tika-python (with 1.9 tika) against it:
> >
> >[mattmann-0420740:~] mattmann% tika-python detect type blah.txt
> >tika.py: Retrieving
> >
> http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server
> >/
> >1.9/tika-server-1.9.jar to
> >/var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar.
> >[(200, u'text/plain')]
> >[mattmann-0420740:~] mattmann% tika-python language file blah.txt
> >[(200, u'en')]
> >[mattmann-0420740:~] mattmann%
> >
> >Is what what you would expect? In general the language detection using
> >
> >N-grams and gets better when there is more text as a sample but it can
> >get fooled sometimes too.
> >
> >Let me know what you think.
> >
> >Cheers,
> >Chris
> >
> >CC / memex-jpl@
> >
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Chief Architect
> >Instrument Software and Science Data Systems Section (398)
> >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> >Office: 168-519, Mailstop: 168-527
> >Email: [email protected]
> >WWW:
> >http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/>
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California, Los Angeles, CA 90089 USA
> >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >-----Original Message-----
> >From: Alejandro Caceres <[email protected]>
> >Date: Thursday, July 16, 2015 at 11:53 AM
> >To: jpluser <[email protected]>
> >Cc: Amanda Towler <[email protected]>
> >Subject: Tika Issue?
> >
> >>Hey Chris,
> >>
> >>
> >>I was about to submit this as a bug, but figured I'd run it by you first.
> >>Maybe you've encountered a similar issue.
> >>
> >>
> >>I'm doing some basic language categorization of websites, I saw that the
> >>Tika server/tika-python returns content as plain text, which is great to
> >>send to Tika language categorization (and just generally useful).
> >>However,
> >> it seemed to get very confused with sites that have footers in various
> >>languages, this is actually really common in the results we've found. For
> >>example, we have a totally English site and at the bottom is some links
> >>to the same site in other languages. This
> >> page, even though it's mostly English, gets categorized as a seemingly
> >>random language (like Lithuanian).
> >>
> >>
> >>As a workaround we tried running the web pages through a text
> >>summarization algo using lxml-readability, which gives us back a subset
> >>of the text on a page. My thinking was this would most likely strip
> >>footers and headers
> >> and give us back a decent representative sample of text on the page. The
> >>results seem to have improved a bit, but we're still getting some funky
> >>results where English pages are categorized as a seemingly random
> >>language, in many cases these pages seem pretty
> >> obviously (to the human eye) to be English.
> >>
> >>
> >>I wonder if someone at JPL (I don't see anyone from JPL here right now)
> >>could shed some light on why this might be happening. I've attached a
> >>couple of samples below. Also let me know if you'd like me to file any
> >>bugs anywhere
> >> to better track this, I just wanted to shoot this to you first to see if
> >>perhaps I was missing something obvious.
> >>
> >>
> >>
> >>Alex
> >>
> >>
> >>--
> >>___
> >>
> >>Alejandro Caceres
> >>Hyperion Gray, LLC
> >>Owner/CTO
> >>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >--
> >___
> >
> >Alejandro Caceres
> >Hyperion Gray, LLC
> >Owner/CTO
> >
>
>


-- 
___

Alejandro Caceres
Hyperion Gray, LLC
Owner/CTO

Reply via email to