Hi, I found the reason of that exception! If you look into my crawl.log carefully then you notice these lines:
060104 213608 Parsing [http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with [EMAIL PROTECTED] 060104 213609 Unable to successfully parse content http://220.000.000.001/otd_04_Detailed_Design_Document.doc of type application/msword 060104 213609 Error parsing: http://220.000.000.001/otd_04_Detailed_Design_Document.doc: notparsed(0,0) 060104 213609 Using Signature impl: org.apache.nutch.crawl.MD5Signature I have one word document which can not be parsed properly. This causes the issue. If I remove this document then nutch finishes correctly. But if the list of files contains this word document (or even if this is the only document to be crawled) then I always receive that exception. Can anybody look at this issue? If anybody is interested in that word document I can send it (but I really wouldn't like to see it becomig regular part of nutch test package, you know it is some kind of internal document [though it does not contain any useful infomration] :-) But in general I think there should be some test in nutch to assure that it can stand such documents. Regards, Lukas On 1/5/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote: > Hi Andrzej, > > This is what sets Fetcher to parse to true or false, right? > > <property> > <name>fetcher.parse</name> > <value>true</value> > <description>If true, fetcher will parse content.</description> > </property> > > I don't have my nutch-default and nutch-site files with me right now > but I would say that for 95% I didn't change this value in my > nutch-site (and I didn't change nutch-default at all). > > So the answer is YES, Fetcher is in parsing mode (with ~ 95% confience). > > I am running nutch against my local apache (not visible for you). But > you may noticed that I used depth=2 so only a few pages (16 to be > exact) are crawled. If you are interested I can send you them all so > that you can upload this content on any server you need for your > tests. > > Look into crawl.log file (attached to previous email sent at 8:21am > today) for deatils. > > I will try to simulate this issue with one or two arbitraty html > pages. If that will produce the issue then I can send you them. > > Lukas > > On 1/5/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > > Lukas Vlcek wrote: > > > > >How can I learn that? > > >What I do is running regular one-step command [/bin/nutch crawl] > > > > > > > > > > In that case your nutch-default.xml / nutch-site.xml decides, there is a > > boolean option there. If you didn't change this, then it defaults to > > true (i.e. your fetcher is parsing the content). > > > > Is it easy to reproduce this if I knew the seed urls? If that's the > > case, please send me the seed urls (contact me off the list, if it's > > sensitive). > > > > -- > > Best regards, > > Andrzej Bialecki <>< > > ___. ___ ___ ___ _ _ __________________________________ > > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > > ___|||__|| \| || | Embedded Unix, System Integration > > http://www.sigram.com Contact: info at sigram dot com > > > > > > >
