Sorry I'm not sure if the issue is resolved in 0.8.
I don't think so, cause none of the nutch core developer pay attention
to the bug till now.
I tried the last 0.8dev version, but that doesn't run at the moment:
060218 175839 parsing file:/D:/dev_soft/nutch08dev_20060217/conf/hadoop-site.xml
java.io.IOException: No input directories specified in: Configuration:
defaults: hadoop-default.xml , mapred-default.xml , \tmp\hadoop\mapre
d\local\localRunner\job_lrdugz.xmlfinal: hadoop-site.xml
at
org.apache.hadoop.mapred.InputFormatBase.listFiles(InputFormatBase.java:84)
at
org.apache.hadoop.mapred.InputFormatBase.getSplits(InputFormatBase.java:94)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:70)
060218 175840 map 0% reduce 0%
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:310)
at org.apache.nutch.crawl.Injector.inject(Injector.java:114)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:104)
On 2/16/06, Franz Werfel <[EMAIL PROTECTED]> wrote:
> Hey, thanks a lot, that worked!! ;-)
> BTW, is the issue resolved in 0.8?
> Thanks again,
> Frank.
>
>
> On 2/16/06, mos <[EMAIL PROTECTED]> wrote:
> > Try to increase the value for the parameter of
> >
> > <property>
> > <name>fetcher.threads.per.host</name>
> > <value>1</value>
> > </property>
> >
> > This could help if you crawl pages from one host and if you run into
> > time-outs.
> >
> > By the way:
> > It's important to avoid time-outs because in Nutch 0.7.1 there is a bug that
> > prevents the crawler to refetch those pages. See:
> > http://issues.apache.org/jira/browse/NUTCH-205
> > (At the moment the apache jira is unvailable)
> >
> >
> >
> >
> >
> > On 2/16/06, Franz Werfel <[EMAIL PROTECTED]> wrote:
> > > Hello, When trying to fetch pages from a specific web site, I end up
> > > with 80% of the fetches timing out. Those 80% are always the same urls
> > > (not random) and occur no matter which limit I set in
> > > fetcher.server.delay and retries (http.max.delays).
> > > However, those same pages load fine when retrieved from a browser, and
> > > use no redirect, etc. In fact, they seem no different than the pages
> > > that do not time out (although they must be different in some way?)
> > > I am at a loss to understand what is going on. In what direction
> > > should one go to investigate this problem?
> > > Thanks,
> > > Frank.
> > >
> >
>
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid3432&bid#0486&dat1642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general