Huh... anybody interested in this? Normally I would be so pushy but to me it seems that Nutch dies if it meets word document which can't be parsed. This seems like a serious issue to me. Or did I overlooked something important/fundamental? Lukas
On 1/6/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Hi, > > I found the reason of that exception! > If you look into my crawl.log carefully then you notice these lines: > > 060104 213608 Parsing > [http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with > [EMAIL PROTECTED] > 060104 213609 Unable to successfully parse content > http://220.000.000.001/otd_04_Detailed_Design_Document.doc of type > application/msword > 060104 213609 Error parsing: > http://220.000.000.001/otd_04_Detailed_Design_Document.doc: > notparsed(0,0) > 060104 213609 Using Signature impl: org.apache.nutch.crawl.MD5Signature > > I have one word document which can not be parsed properly. This causes > the issue. If I remove this document then nutch finishes correctly. > But if the list of files contains this word document (or even if this > is the only document to be crawled) then I always receive that > exception. > > Can anybody look at this issue? > If anybody is interested in that word document I can send it (but I > really wouldn't like to see it becomig regular part of nutch test > package, you know it is some kind of internal document [though it does > not contain any useful infomration] :-) > > But in general I think there should be some test in nutch to assure > that it can stand such documents. > > Regards, > Lukas > > On 1/5/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > > Hi Andrzej, > > > > This is what sets Fetcher to parse to true or false, right? > > > > <property> > > <name>fetcher.parse</name> > > <value>true</value> > > <description>If true, fetcher will parse content.</description> > > </property> > > > > I don't have my nutch-default and nutch-site files with me right now > > but I would say that for 95% I didn't change this value in my > > nutch-site (and I didn't change nutch-default at all). > > > > So the answer is YES, Fetcher is in parsing mode (with ~ 95% confience). > > > > I am running nutch against my local apache (not visible for you). But > > you may noticed that I used depth=2 so only a few pages (16 to be > > exact) are crawled. If you are interested I can send you them all so > > that you can upload this content on any server you need for your > > tests. > > > > Look into crawl.log file (attached to previous email sent at 8:21am > > today) for deatils. > > > > I will try to simulate this issue with one or two arbitraty html > > pages. If that will produce the issue then I can send you them. > > > > Lukas > > > > On 1/5/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > > Lukas Vlcek wrote: > > > > > > >How can I learn that? > > > >What I do is running regular one-step command [/bin/nutch crawl] > > > > > > > > > > > > > > In that case your nutch-default.xml / nutch-site.xml decides, there is a > > > boolean option there. If you didn't change this, then it defaults to > > > true (i.e. your fetcher is parsing the content). > > > > > > Is it easy to reproduce this if I knew the seed urls? If that's the > > > case, please send me the seed urls (contact me off the list, if it's > > > sensitive). > > > > > > -- > > > Best regards, > > > Andrzej Bialecki <>< > > > ___. ___ ___ ___ _ _ __________________________________ > > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > > ___|||__|| \| || | Embedded Unix, System Integration > > > http://www.sigram.com Contact: info at sigram dot com > > > > > > > > > > > >