Huh...
anybody interested in this?
Normally I would be so pushy but to me it seems that Nutch dies if it
meets word document which can't be parsed. This seems like a serious
issue to me.
Or did I overlooked something important/fundamental?
Lukas

On 1/6/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I found the reason of that exception!
> If you look into my crawl.log carefully then you notice these lines:
>
> 060104 213608 Parsing
> [http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with
> [EMAIL PROTECTED]
> 060104 213609 Unable to successfully parse content
> http://220.000.000.001/otd_04_Detailed_Design_Document.doc of type
> application/msword
> 060104 213609 Error parsing:
> http://220.000.000.001/otd_04_Detailed_Design_Document.doc:
> notparsed(0,0)
> 060104 213609 Using Signature impl: org.apache.nutch.crawl.MD5Signature
>
> I have one word document which can not be parsed properly. This causes
> the issue. If I remove this document then nutch finishes correctly.
> But if the list of files contains this word document (or even if this
> is the only document to be crawled) then I always receive that
> exception.
>
> Can anybody look at this issue?
> If anybody is interested in that word document I can send it (but I
> really wouldn't like to see it becomig regular part of nutch test
> package, you know it is some kind of internal document [though it does
> not contain any useful infomration] :-)
>
> But in general I think there should be some test in nutch to assure
> that it can stand such documents.
>
> Regards,
> Lukas
>
> On 1/5/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> > Hi Andrzej,
> >
> > This is what sets Fetcher to parse to true or false, right?
> >
> > <property>
> >   <name>fetcher.parse</name>
> >   <value>true</value>
> >   <description>If true, fetcher will parse content.</description>
> > </property>
> >
> > I don't have my nutch-default and nutch-site files with me right now
> > but I would say that for 95% I didn't change this value in my
> > nutch-site (and I didn't change nutch-default at all).
> >
> > So the answer is YES, Fetcher is in parsing mode (with ~ 95% confience).
> >
> > I am running nutch against my local apache (not visible for you). But
> > you may noticed that I used depth=2 so only a few pages (16 to be
> > exact) are crawled. If you are interested I can send you them all so
> > that you can upload this content on any server you need for your
> > tests.
> >
> > Look into crawl.log file (attached to previous email sent at 8:21am
> > today) for deatils.
> >
> > I will try to simulate this issue with one or two arbitraty html
> > pages. If that will produce the issue then I can send you them.
> >
> > Lukas
> >
> > On 1/5/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > > Lukas Vlcek wrote:
> > >
> > > >How can I learn that?
> > > >What I do is running regular one-step command [/bin/nutch crawl]
> > > >
> > > >
> > >
> > > In that case your nutch-default.xml / nutch-site.xml decides, there is a
> > > boolean option there. If you didn't change this, then it defaults to
> > > true (i.e. your fetcher is parsing the content).
> > >
> > > Is it easy to reproduce this if I knew the seed urls? If that's the
> > > case, please send me the seed urls (contact me off the list, if it's
> > > sensitive).
> > >
> > > --
> > > Best regards,
> > > Andrzej Bialecki     <><
> > >  ___. ___ ___ ___ _ _   __________________________________
> > > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > > http://www.sigram.com  Contact: info at sigram dot com
> > >
> > >
> > >
> >
>

Reply via email to