Hi,

I found the reason of that exception!
If you look into my crawl.log carefully then you notice these lines:

060104 213608 Parsing
[http://220.000.000.001/otd_04_Detailed_Design_Document.doc] with
[EMAIL PROTECTED]
060104 213609 Unable to successfully parse content
http://220.000.000.001/otd_04_Detailed_Design_Document.doc of type
application/msword
060104 213609 Error parsing:
http://220.000.000.001/otd_04_Detailed_Design_Document.doc:
notparsed(0,0)
060104 213609 Using Signature impl: org.apache.nutch.crawl.MD5Signature

I have one word document which can not be parsed properly. This causes
the issue. If I remove this document then nutch finishes correctly.
But if the list of files contains this word document (or even if this
is the only document to be crawled) then I always receive that
exception.

Can anybody look at this issue?
If anybody is interested in that word document I can send it (but I
really wouldn't like to see it becomig regular part of nutch test
package, you know it is some kind of internal document [though it does
not contain any useful infomration] :-)

But in general I think there should be some test in nutch to assure
that it can stand such documents.

Regards,
Lukas

On 1/5/06, Lukas Vlcek <[EMAIL PROTECTED]> wrote:
> Hi Andrzej,
>
> This is what sets Fetcher to parse to true or false, right?
>
> <property>
>   <name>fetcher.parse</name>
>   <value>true</value>
>   <description>If true, fetcher will parse content.</description>
> </property>
>
> I don't have my nutch-default and nutch-site files with me right now
> but I would say that for 95% I didn't change this value in my
> nutch-site (and I didn't change nutch-default at all).
>
> So the answer is YES, Fetcher is in parsing mode (with ~ 95% confience).
>
> I am running nutch against my local apache (not visible for you). But
> you may noticed that I used depth=2 so only a few pages (16 to be
> exact) are crawled. If you are interested I can send you them all so
> that you can upload this content on any server you need for your
> tests.
>
> Look into crawl.log file (attached to previous email sent at 8:21am
> today) for deatils.
>
> I will try to simulate this issue with one or two arbitraty html
> pages. If that will produce the issue then I can send you them.
>
> Lukas
>
> On 1/5/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> > Lukas Vlcek wrote:
> >
> > >How can I learn that?
> > >What I do is running regular one-step command [/bin/nutch crawl]
> > >
> > >
> >
> > In that case your nutch-default.xml / nutch-site.xml decides, there is a
> > boolean option there. If you didn't change this, then it defaults to
> > true (i.e. your fetcher is parsing the content).
> >
> > Is it easy to reproduce this if I knew the seed urls? If that's the
> > case, please send me the seed urls (contact me off the list, if it's
> > sensitive).
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
> >
> >
>

Reply via email to