You may be running into problems with regex stalls on filtering. Try removing the regex filter from the nutch-site.xml plugin.includes property. I was having similar problems before switching to just use prefix and suffix filters as below. I attached my prefix and suffix url filter files that go in conf. I am only indexing http files so you may need to modify these. Hope this helps.

<property>
 <name>plugin.includes</name>
<value>protocol-http|urlfilter-(suffix|prefix)|parse-(text|html|js)|index-basic|query-(basic|site|url)|summary-basic|scoring-opic</value>
 <description>Regular expression naming plugin directory names to
 include.  Any plugin not matching this expression is excluded.
 In any case you need at least include the nutch-extensionpoints plugin. By
 default Nutch includes crawling just HTML and plain text via HTTP,
 and basic indexing and search plugins.
 </description>
</property>

Dennis

Vishal Shah wrote:
Hi,
I've been trying to get the nutch fetcher to work since a couple of
days, but it always hangs on one of the reduce processes, and the job is
aborted. I am using numFetchers=24 during generate, 24 map tasks and 6
reduce tasks during fetch on a 3 machine cluster. The task that failed
was tried atleast 3 times, before the job was aborted.
I looked into the logs on one of the machines with the failed tasks,
and I see these errors:
1) 2006-09-08 18:04:03,294 INFO mapred.TaskTracker -
task_0003_r_000004_3: Task failed to report status for 608 seconds.
Killin
g.
2) java.lang.IllegalStateException
        at
org.mortbay.jetty.servlet.ServletHttpResponse.getWriter(ServletHttpRespo
nse.java:561)
        at
org.apache.jasper.runtime.JspWriterImpl.initOut(JspWriterImpl.java:122)
        at
org.apache.jasper.runtime.JspWriterImpl.flushBuffer(JspWriterImpl.java:1
15)
        at
org.apache.jasper.runtime.PageContextImpl.release(PageContextImpl.java:1
90)
        at
org.apache.jasper.runtime.JspFactoryImpl.internalReleasePageContext(JspF
actoryImpl.java:115)
        at
org.apache.jasper.runtime.JspFactoryImpl.releasePageContext(JspFactoryIm
pl.java:75)
        at
org.apache.hadoop.mapred.getMapOutput_jsp._jspService(getMapOutput_jsp.j
ava:100)
        at
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:94)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:802)
        at
org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:427)
        at
org.mortbay.jetty.servlet.WebApplicationHandler.dispatch(WebApplicationH
andler.java:475)
        at
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:567)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1565)
        at
org.mortbay.jetty.servlet.WebApplicationContext.handle(WebApplicationCon
text.java:635)
        at org.mortbay.http.HttpContext.handle(HttpContext.java:1517)
        at org.mortbay.http.HttpServer.service(HttpServer.java:954)
        at
org.mortbay.http.HttpConnection.service(HttpConnection.java:814)
        at
org.mortbay.http.HttpConnection.handleNext(HttpConnection.java:981)
        at
org.mortbay.http.HttpConnection.handle(HttpConnection.java:831)
        at
org.mortbay.http.SocketListener.handleConnection(SocketListener.java:244
)
        at
org.mortbay.util.ThreadedServer.handle(ThreadedServer.java:357)
        at
org.mortbay.util.ThreadPool$PoolThread.run(ThreadPool.java:534)
Any idea where the problem is, and how to rectify it? Regards, -vishal.

http
# config file for urlfilter-suffix plugin

# case-insensitive, allow unknown suffixes
+I

# prohibit these
.gif
.jpg
.jpeg
.bmp
.png
.ico
.css
.sit
.eps
.wmf
.zip
.ppt
.mpg
.xls
.gz
.tar
.rpm
.rm
.tgz
.mov
.exe
.vid
.ai
.pdf
.txt
.psd
-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to