Hi, I do not have access on the other website's server so I could not see what is going on the other side, all I know is that when I run the website on a browser, I see the pages and documentation and their website is running fine.
Is there a way to tell Nutch NOT to use the http host and port because the server and website I am crawling are on the same network segment? Or does Nutch ignore these parameters when they are null? I found a difference in the log file when I place just "any" http host and port in the nutch-site.xml file. Log when: Nutch uses the default host and port if I entirely delete the host and port parameters from the nutch-site.xml ------------------------------------------------------------------------ ---------------------------------------- Fetcher: starting Fetcher: segment: crawl/segments/20080620183434 Fetcher: threads: 10 fetching http://v4:10000/lib http.proxy.host = null <------------------------ http.proxy.port = 8080 <------------------------ http.timeout = 300000 http.content.limit = 262144 http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 1000 http.max.delays = 100 Configured Client java.net.SocketException: Connection reset <------------------------ at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav a:1115) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa se.java:1832) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase .java:1590) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java :995) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe thodDirector.java:397) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho dDirector.java:170) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 96) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 24) at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.ja va:96) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j ava:219) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) fetch of http://v4:10000/lib failed with: java.net.SocketException: Connection reset Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080620183434] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20080620183408 LinkDb: adding segment: crawl/segments/20080620183424 LinkDb: adding segment: crawl/segments/20080620183434 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20080620183408 Indexer: adding segment: crawl/segments/20080620183424 Indexer: adding segment: crawl/segments/20080620183434 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) Log when: Nutch uses the host and port parameters from the nutch-site.xml where I just placed anything ------------------------------------------------------------------------ ---------------------------------------- Fetcher: starting Fetcher: segment: crawl/segments/20080620184021 Fetcher: threads: 10 fetching http://iweb.corp.ebay.com/ fetching http://v4:10000/lib http.proxy.host = v4 <------------------------ http.proxy.port = 10000 <------------------------ http.timeout = 300000 http.content.limit = 262144 http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 1000 http.max.delays = 100 Configured Client http.proxy.host = v4 http.proxy.port = 10000 http.timeout = 300000 http.content.limit = 262144 http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) protocol.plugin.check.blocking = true protocol.plugin.check.robots = true fetcher.server.delay = 1000 http.max.delays = 100 Configured Client java.net.SocketException: Connection reset <------------------------ java.net.SocketException: Connection reset <------------------------ at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.net.SocketInputStream.read(SocketInputStream.java:168) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:235) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77) at org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) at org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav a:1115) at org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav a:1115) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) at org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa se.java:1832) at org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa se.java:1832) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase .java:1590) at org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase .java:1590) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java :995) at org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java :995) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe thodDirector.java:397) at org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe thodDirector.java:397) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho dDirector.java:170) at org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho dDirector.java:170) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 96) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 96) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 24) at org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3 24) at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.ja va:96) at org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.ja va:96) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j ava:219) at org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j ava:219) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) fetch of http://v4:10000/lib failed with: java.net.SocketException: Connection reset fetch of http://iweb.corp.ebay.com/ failed with: java.net.SocketException: Connection reset Fetcher: done CrawlDb update: starting CrawlDb update: db: crawl/crawldb CrawlDb update: segments: [crawl/segments/20080620184021] CrawlDb update: additions allowed: true CrawlDb update: URL normalizing: true CrawlDb update: URL filtering: true CrawlDb update: Merging segment data into db. CrawlDb update: done LinkDb: starting LinkDb: linkdb: crawl/linkdb LinkDb: URL normalize: true LinkDb: URL filter: true LinkDb: adding segment: crawl/segments/20080620184000 LinkDb: adding segment: crawl/segments/20080620184010 LinkDb: adding segment: crawl/segments/20080620184021 LinkDb: done Indexer: starting Indexer: linkdb: crawl/linkdb Indexer: adding segment: crawl/segments/20080620184000 Indexer: adding segment: crawl/segments/20080620184010 Indexer: adding segment: crawl/segments/20080620184021 Optimizing index. Indexer: done Dedup: starting Dedup: adding indexes in: crawl/indexes Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) at org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43 9) at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) Thanks, Ann Del Rio -----Original Message----- From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] Sent: Thursday, June 19, 2008 10:54 PM To: [email protected] Subject: Re: how does nutch connect to urls internally? Hi Ann, Regarding frames - this is not the problem here (with Nutch), as Nutch doesn't even seem to be able to connect to your server. It never gets to see the HTML and frames in it. Perhaps there is something useful in the logs not on the Nutch side, but on that v4 server. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "Del Rio, Ann" <[EMAIL PROTECTED]> > To: [email protected] > Sent: Thursday, June 19, 2008 6:54:15 PM > Subject: RE: how does nutch connect to urls internally? > > > Hello, > > I tried this simple junit program before I will try the nutch classes > for http, > > import java.io.BufferedInputStream; > import java.io.StringWriter; > import java.net.URL; > import junit.framework.TestCase; > public class BinDoxTest extends TestCase { > public void testHttp() { > try { > URL url = new > URL("http://v4:10000/lib"); > StringWriter writer = new StringWriter(); > BufferedInputStream in = new > BufferedInputStream(url.openStream()); > for (int c = in.read(); c != -1; c = > in.read()) { > writer.write(c); > } > System.out.println(writer); > } catch (Exception e) { > // TODO: handle exception > } > } > } > > And got the following output which is the same as if I issued a wget > in linux shell. > > > "http://www.w3.org/TR/html4/loose.dtd"> > > > Bindox Library> > > href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico" > type="image/vnd.microsoft.icon"> > > > > > > border="4" frameborder="1" scrolling="no"> > > > marginwidth="0" marginheight="0" scrolling="no" frameborder="1" > resize=yes> > > > src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp' > marginwidth="0" marginheight="0" scrolling="no" frameborder="0" > resize=yes> > > > > > > Can you please help provide enlightenment if there is something funky > with this starting page of the website from where Nutch gives me a > "SocketException: Connection Reset Error" when I run the nutch to > start indexing from the page above? Can nutch index "frames"? > > I will try http next as our network admin said it might be an issue > with VM Ware freezing or timing-out for http 1.0 and not http 1.1 > > Thanks, > Ann Del Rio > > -----Original Message----- > From: Susam Pal [mailto:[EMAIL PROTECTED] > Sent: Monday, June 16, 2008 9:48 AM > To: [email protected] > Subject: Re: how does nutch connect to urls internally? > > Hi, > > It depends on which protocol plugin is enabled in your > 'conf/nutch-site.xml'. The property to look for is 'plugins.include' > in the XML file. If this is not present in 'conf/nutch-site.xml', it > means you are using the default 'plugins.include' of > 'conf/nutch-default.xml'. > > If protocol-http is enabled, then you have to go through the code in:- > > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http. > ja > va > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpR > es > ponse.java > > If protocol-httpclient is enabled, then you have to go through:- > > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/http > cl > ient/Http.java > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/http > cl > ient/HttpResponse.java > > Enabling DEBUG logs in 'conf/log4j.properties' will also give you > clues about the problems. The logs are written to 'logs/hadoop.log'. > To enable the DEBUG logs for a particular package, say, the httpclient > package, you can open 'conf/log4j.properties' and add the following > line: > > log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout > > Regards, > Susam Pal > > On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann wrote: > > Good morning, > > > > Can you please point me to a Nutch documentation where I can find > > how nutch connects to the webpages when it crawls? I think it is > > through HTTP but i would like to confirm and get more details so i > > can write a > > > very small test java program to connect to one of the webpages i am > > having trouble connecting / crawling. I bought Lucene in Action and > > am > > > half way thru the book and so far there is very little about Nutch. > > > > Thanks, > > Ann Del Rio
