Sorry I'm not sure if the issue is resolved in 0.8.
I don't think so, cause none of the nutch core developer pay attention
to the bug till now.
I tried the last 0.8dev version, but that doesn't run at the moment:
060218 175839 parsing file:/D:/dev_soft/nutch08dev_20060217/conf/hadoop-site.xml
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml , \tmp\hadoop\mapre
d\local\localRunner\job_lrdugz.xmlfinal: hadoop-site.xml
at
org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.java:84)
at
org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:94)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:70)
060218 175840 map 0% reduce 0%
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
On 2/16/06, Franz Werfel <[EMAIL PROTECTED]> wrote:
> Hey, thanks a lot, that worked!! ;-)
> BTW, is the issue resolved in 0.8?
> Thanks again,
> Frank.
>
>
> On 2/16/06, mos <[EMAIL PROTECTED]> wrote:
> > Try to increase the value for the parameter of
> >
> > <property>
> > <name>fetcher.threads.per.host</name>
> > <value>1</value>
> > </property>
> >
> > This could help if you crawl pages from one host and if you run into
> > time-outs.
> >
> > By the way:
> > It's important to avoid time-outs because in Nutch 0.7.1 there is a bug that
> > prevents the crawler to refetch those pages. See:
> > http://issues.apache.org/jira/browse/NUTCH-205
> > (At the moment the apache jira is unvailable)
> >
> >
> >
> >
> >
> > On 2/16/06, Franz Werfel <[EMAIL PROTECTED]> wrote:
> > > Hello, When trying to fetch pages from a specific web site, I end up
> > > with 80% of the fetches timing out. Those 80% are always the same urls
> > > (not random) and occur no matter which limit I set in
> > > fetcher.server.delay and retries (http.max.delays).
> > > However, those same pages load fine when retrieved from a browser, and
> > > use no redirect, etc. In fact, they seem no different than the pages
> > > that do not time out (although they must be different in some way?)
> > > I am at a loss to understand what is going on. In what direction
> > > should one go to investigate this problem?
> > > Thanks,
> > > Frank.
> > >
> >
>