Hi Mike,
Thanks for your advice.
However, thinking about that the problem happens in level two and not in level one which means that you successly fetched the link you mentioned but you couldn't fetch the links it points to.

so actually you have to find the link in the second level that make the problem. Any way as you mention removing teh jr is not enough, I still have the same problem.
Thanks again,
Rafit


From: Mike Smith <[EMAIL PROTECTED]>
Reply-To: nutch-user@lucene.apache.org
To: nutch-user@lucene.apache.org
Subject: Re: Problems with MapRed-
Date: Wed, 1 Feb 2006 17:52:17 -0800

Hi Andrzej

I repeated the crawl with plugged JS parser and problem happeded again, but
by removing JS parser everything goes smoothly. I am using a single machine
and verything is running locally but using ndfs. Have you tried that URL to
see if you can crawl that for depth 2? in the tasktracker log at the end of
reduce phase of the fetcher I get these exceptions:


060201 170138 task_r_1x07jh 1.0% reduce > reduce
060201 170138 Task task_r_1x07jh is done.
060201 170138 Server connection on port 50050 from 164.67.195.201: exiting
060201 171130 Task task_r_lhf8xv timed out.  Killing.
060201 171131 Server connection on port 50050 from 164.67.195.201: exiting
060201 171131 task_r_lhf8xv Child Error
java.io.IOException: Task process exit with nonzero status.
at org.apache.nutch.mapred.TaskRunner.runChild(TaskRunner.java:139)
        at org.apache.nutch.mapred.TaskRunner.run(TaskRunner.java:92)



Then it keeps giving these errors:



060201 171136 task_r_lhf8xv  Error running child
060201 171136 task_r_lhf8xv java.io.IOException: Cannot create file
/user/nima/c_tac/segments/20060201170028/crawl_fetch/part-00000/data on
client NDFSClient_-52973852
060201 171136 task_r_lhf8xv     at org.apache.nutch.ipc.Client.call(
Client.java:294)
060201 171136 task_r_lhf8xv     at org.apache.nutch.ipc.RPC$Invoker.invoke(
RPC.java:127)
060201 171136 task_r_lhf8xv     at $Proxy1.create(Unknown Source)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.nextBlockOutputStream(
NDFSClient.java:546)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.ndfs.NDFSClient$NDFSOutputStream.<init>(NDFSClient.java
:521)
060201 171136 task_r_lhf8xv     at org.apache.nutch.ndfs.NDFSClient.create(
NDFSClient.java:83)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.fs.NDFSFileSystem.createRaw(NDFSFileSystem.java:71)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.fs.NFSDataOutputStream$Summer.<init>(
NFSDataOutputStream.java:41)
060201 171136 task_r_lhf8xv     at org.apache.nutch.fs.NFSDataOutputStream
.<init>(NFSDataOutputStream.java:129)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.fs.NutchFileSystem.create(NutchFileSystem.java:187)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.fs.NutchFileSystem.create(NutchFileSystem.java:174)
060201 171136 task_r_lhf8xv     at org.apache.nutch.io.SequenceFile$Writer
.<init>(SequenceFile.java:94)
060201 171136 task_r_lhf8xv     at org.apache.nutch.io.MapFile$Writer
.<init>(MapFile.java:108)
060201 171136 task_r_lhf8xv     at org.apache.nutch.io.MapFile$Writer
.<init>(MapFile.java:76)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.fetcher.FetcherOutputFormat.getRecordWriter(
FetcherOutputFormat.java:50)
060201 171136 task_r_lhf8xv     at org.apache.nutch.mapred.ReduceTask.run(
ReduceTask.java:242)
060201 171136 task_r_lhf8xv     at
org.apache.nutch.mapred.TaskTracker$Child.main(TaskTracker.java:603)
060201 171137 Server connection on port 50050 from 164.67.195.201: exiting


I am still crawling a larger set, I will update as soon as it finishes.

Thanks, Mike.





On 2/1/06, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
>
> Mike Smith wrote:
> > I finally find out why this problem happens, there should be a problem
> with
> > the JS parser. Because I used this:
> >
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)</value>
> >
> > instead of the default one which has JS in it and I could fetch
> > http://www.globalmedlaw.com/Canadam.html by depth 2. But, when I use
> >
> > <name>plugin.includes</name>
> >
> >
> <value>protocol-http|urlfilter-regex|parse-(text|html|js)|index-basic|query-(basic|site|url)</value>
> >
> > reduce will fail at the end fetching. I came up with this solution
> because
> > that page was using a redirected JS page to have some dynamic contents,
> but
> > by removing the JS plugin it worked fine. Now, I am going to have a
> larger
> > crawl over 100,000 seed urls to see if this really solved the problem.
> >
> > Do you have any problem with JS parser?
> >
>
> That's an interesting observation. Could you perhaps check what is the
> exception (if any) from the JS parser when it's failing? It could be
> emitted into one of the tasktracker logs.
>
> --
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

_________________________________________________________________
Express yourself instantly with MSN Messenger! Download today it's FREE! http://messenger.msn.click-url.com/go/onm00200471ave/direct/01/

Reply via email to