Hey Chris, Awesome! That answers all of my questions in one neat package. Thanks for the help, I'll file the issues you mentioned.
This is all great stuff btw, tika-python is super clean and easy to use. Alex On Thu, Jul 16, 2015 at 4:04 PM, Mattmann, Chris A (3980) < [email protected]> wrote: > Hey Alex, > > You nailed it. The first 2 examples are that there is too small a sample > of text for it to get the language categorization right. Feel free to file > a Tika issue for this (http://issues.apache.org/jira/browse/TIKA). There > was > some talk about integrating the Google language detector into this since > it’s > ALv2, not sure if it will perform better with smaller samples or not. > > As for the latter example, this one is interesting. The main reason is > that it’s not detecting the file as HTML, since it doesn’t have a file > extension, > and since its MIME magic for that page doesn’t match the traditional HTML > magic > (e.g., <html>.. blah). So, it’s parsing that with the TxtParser which well > just > extracts out the characters/text from the stream: > > >>> print string_parsed["metadata"] > {u'Content-Encoding': [u'ISO-8859-1'], u'Content-Type': [u'text/plain; > charset=ISO-8859-1'], u'X-TIKA:parse_time_millis': [u'58'], > u'X-Parsed-By': [[u'org.apache.tika.parser.DefaultParser', > u'org.apache.tika.parser.txt.TXTParser']]} > >>> > > Try this. You can use the from_file method to parse URLs as well. Those > > URLs will be downloaded in Tika python to /tmp as files, and then parsed > from there. If you use .from_file on the above, it will correctly just > strip > the text out, and then the language detector works. Try this: > > from tika import parser > from tika import language > > parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html") > lang = language.from_buffer(parsed["content"]) > print lang > > > Which should print out: > > >>> parsed = parser.from_file("http://ferretspatternu.ucoz.com/index.html > ") > tika.py: Retrieving http://ferretspatternu.ucoz.com/index.html to > /tmp/index.html. > >>> lang = language.from_buffer(parsed["content"]) > >>> print lang > en > >>> > > > Note that I had to add /index.html at the end. Our code to create the > tmp file in tika-python needs some prettying up so if the URL doesn’t end > with an actual file or extension it craps out. For now you can work around > it that way. If you have time please file a Github issue for tika-python > and we’ll make this better. > > Cheers HTH! > > Cheers, > Chris > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > -----Original Message----- > From: Alejandro Caceres <[email protected]> > Date: Thursday, July 16, 2015 at 12:56 PM > To: jpluser <[email protected]> > Cc: Amanda Towler <[email protected]>, "[email protected]" > <[email protected]>, "[email protected]" > <[email protected]> > Subject: Re: Tika Issue? > > >Gah I messed up the bug story. You're right, that text is categorized as > >en, I screwed up with the file. Here is a better/more accurate summary of > >what I'm seeing, with some examples. Pretend the previous email > > was all a terrible dream. > > > > > >There appear to be two potential issues going on, let's start with the > >language categorization because I already brought it up. I've attached 3 > >files below for reference. language_no_1 and language_no_2 are > > both picked up as Norwegian, I suspect this is because there's a small > >amount of text. language_lt_2 is probably the most interesting to me, > >this text is picked up as Lithuanian, seems to have a good amount of > >text, but has a footer that is in various languages. > > I suspected that was throwing it off, however most of the text is > >definitely English so perhaps something else is going on. > > > > > >The other issue I'm seeing is with the parser, but maybe I've > >misunderstood something. Here is some code: > > > > > >import requests > >from tika import parser > >from tika import language > > > > > >r = requests.get("http://ferretspatternu.ucoz.com/") > >string_parsed = parser.from_buffer(r.text) > >lang = language.from_buffer(string_parsed["content"]) > >print string_parsed["content"] > >print lang > > > > > >The language is picked up Lithuanian, however I see why. The "content" > >field looks like it is not plain text, but instead raw HTML. In other > >documents this field looks like it contains sanitized text... or am I > >missing something? > > > > > > > >Anyway, hope that's all a little bit clearer! Let me know what you think. > > > > > > > >Alex > > > > > >PS this is with the latest version of tika-python running tika server 1.9 > > > > > > > > > > > > > > > > > > > > > >On Thu, Jul 16, 2015 at 2:58 PM, Mattmann, Chris A (3980) > ><[email protected]> wrote: > > > >Hey Alex, what version of Tika Python are you using? And moreover > >what version of Tika? I’m CC’ing folks on [email protected] hope you > >don’t mind. > > > >I took the file you attached and saved it as blah.txt and ran > >tika-python (with 1.9 tika) against it: > > > >[mattmann-0420740:~] mattmann% tika-python detect type blah.txt > >tika.py: Retrieving > > > http://search.maven.org/remotecontent?filepath=org/apache/tika/tika-server > >/ > >1.9/tika-server-1.9.jar to > >/var/folders/05/5qw82z2d77q16fhxxhwt22tr0000gq/T/tika-server.jar. > >[(200, u'text/plain')] > >[mattmann-0420740:~] mattmann% tika-python language file blah.txt > >[(200, u'en')] > >[mattmann-0420740:~] mattmann% > > > >Is what what you would expect? In general the language detection using > > > >N-grams and gets better when there is more text as a sample but it can > >get fooled sometimes too. > > > >Let me know what you think. > > > >Cheers, > >Chris > > > >CC / memex-jpl@ > > > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Chris Mattmann, Ph.D. > >Chief Architect > >Instrument Software and Science Data Systems Section (398) > >NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > >Office: 168-519, Mailstop: 168-527 > >Email: [email protected] > >WWW: > >http://sunset.usc.edu/~mattmann/ <http://sunset.usc.edu/~mattmann/> > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Adjunct Associate Professor, Computer Science Department > >University of Southern California, Los Angeles, CA 90089 USA > >++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > >-----Original Message----- > >From: Alejandro Caceres <[email protected]> > >Date: Thursday, July 16, 2015 at 11:53 AM > >To: jpluser <[email protected]> > >Cc: Amanda Towler <[email protected]> > >Subject: Tika Issue? > > > >>Hey Chris, > >> > >> > >>I was about to submit this as a bug, but figured I'd run it by you first. > >>Maybe you've encountered a similar issue. > >> > >> > >>I'm doing some basic language categorization of websites, I saw that the > >>Tika server/tika-python returns content as plain text, which is great to > >>send to Tika language categorization (and just generally useful). > >>However, > >> it seemed to get very confused with sites that have footers in various > >>languages, this is actually really common in the results we've found. For > >>example, we have a totally English site and at the bottom is some links > >>to the same site in other languages. This > >> page, even though it's mostly English, gets categorized as a seemingly > >>random language (like Lithuanian). > >> > >> > >>As a workaround we tried running the web pages through a text > >>summarization algo using lxml-readability, which gives us back a subset > >>of the text on a page. My thinking was this would most likely strip > >>footers and headers > >> and give us back a decent representative sample of text on the page. The > >>results seem to have improved a bit, but we're still getting some funky > >>results where English pages are categorized as a seemingly random > >>language, in many cases these pages seem pretty > >> obviously (to the human eye) to be English. > >> > >> > >>I wonder if someone at JPL (I don't see anyone from JPL here right now) > >>could shed some light on why this might be happening. I've attached a > >>couple of samples below. Also let me know if you'd like me to file any > >>bugs anywhere > >> to better track this, I just wanted to shoot this to you first to see if > >>perhaps I was missing something obvious. > >> > >> > >> > >>Alex > >> > >> > >>-- > >>___ > >> > >>Alejandro Caceres > >>Hyperion Gray, LLC > >>Owner/CTO > >> > > > > > > > > > > > > > > > > > > > > > >-- > >___ > > > >Alejandro Caceres > >Hyperion Gray, LLC > >Owner/CTO > > > > -- ___ Alejandro Caceres Hyperion Gray, LLC Owner/CTO
