Hi Ann, No, the v4 doesn't have to support telnet. You are connecting to port 10000 on v4, which is supposed to be some kind of a HTTP server. It looks like that server closes the connection on you. I assume that "Connection closed...." happens immediately, ja?
Curl either has something smart in it to keep this connection from closing, or v4 somehow knows who/what connected and likes curl more than telnet. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "Del Rio, Ann" <[EMAIL PROTECTED]> > To: [email protected] > Sent: Monday, June 23, 2008 12:30:49 PM > Subject: RE: how does nutch connect to urls internally? > > 1. Should I ask our network folks to enable telnet service in the v4 > box? > Nutch use http, does Nutch use telnet? Because I tried the following, > > -bash-2.05b$ telnet v4 10000 > Trying 10.254.231.40... > Connected to vm-v4dev01.arch.ebay.com (10.254.231.40). > Escape character is '^]'. > Connection closed by foreign host. > > 2. I was able to do this on the other hand and got the following, > > -bash-2.05b$ curl http://v4:10000/lib > > > "http://www.w3.or > g/TR/html4/loose.dtd"> > > > > Bindox Library > > href="/classpath/com/ebay/content/sharedcontent/images/favicon. > ico" type="image/vnd.microsoft.icon"> > > > > > > border="4" fr > ameborder="1" scrolling="no"> > > > marginheig > ht="0" scrolling="no" frameborder="1" resize=yes> > > > marginwidth=" > 0" marginheight="0" scrolling="no" frameborder="0" resize=yes> > > > > > > Thanks, > Ann Del Rio > > -----Original Message----- > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > Sent: Friday, June 20, 2008 10:55 PM > To: [email protected] > Subject: Re: how does nutch connect to urls internally? > > That proxy port does look a little suspicious. I can't check my > installation to tell you with certainty if that proxy port should be > printed like that or should be null, too. > > Not sure if we went through this already, but can you: > > $ telnet v4 10000 > GET /lib HTTP/1.0 > (hit enter a few times here) > > What happens? > > > Or: > curl http://v4:1000/lib ? > > Or, if you have libwww: > GET -UsSed http://v4:10000/lib > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- > > From: "Del Rio, Ann" > > To: [email protected] > > Sent: Friday, June 20, 2008 9:53:03 PM > > Subject: RE: how does nutch connect to urls internally? > > > > Hi, > > > > I do not have access on the other website's server so I could not see > > what is going on the other side, all I know is that when I run the > > website on a browser, I see the pages and documentation and their > > website is running fine. > > > > Is there a way to tell Nutch NOT to use the http host and port because > > > the server and website I am crawling are on the same network segment? > > Or does Nutch ignore these parameters when they are null? I found a > > difference in the log file when I place just "any" http host and port > > in the nutch-site.xml file. > > > > Log when: Nutch uses the default host and port if I entirely delete > > the host and port parameters from the nutch-site.xml > > ---------------------------------------------------------------------- > > -- > > ---------------------------------------- > > Fetcher: starting > > Fetcher: segment: crawl/segments/20080620183434 > > Fetcher: threads: 10 > > fetching http://v4:10000/lib > > http.proxy.host = null > > <------------------------ > > http.proxy.port = 8080 > > <------------------------ > > http.timeout = 300000 > > http.content.limit = 262144 > > http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) > > protocol.plugin.check.blocking = true protocol.plugin.check.robots = > > true fetcher.server.delay = 1000 http.max.delays = 100 Configured > > Client > > java.net.SocketException: Connection reset > > <------------------------ > > at java.net.SocketInputStream.read(SocketInputStream.java:168) > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > at java.io.BufferedInputStream.read(BufferedInputStream.java:235) > > at > > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7 > > 7) > > at > > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) > > at > > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j > > av > > a:1115) > > at > > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC > > on > > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) > > at > > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod > > Ba > > se.java:1832) > > at > > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa > > se > > .java:1590) > > at > > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja > > va > > :995) > > at > > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http > > Me > > thodDirector.java:397) > > at > > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet > > ho > > dDirector.java:170) > > at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java > > :3 > > 96) > > at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java > > :3 > > 24) > > at > > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja > > va:96) > > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) > > at > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase > > .j > > ava:219) > > at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) > > fetch of http://v4:10000/lib failed with: java.net.SocketException: > > Connection reset > > Fetcher: done > > CrawlDb update: starting > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20080620183434] CrawlDb > > update: additions allowed: true CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true CrawlDb update: Merging segment > > data into db. > > CrawlDb update: done > > LinkDb: starting > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: crawl/segments/20080620183408 > > LinkDb: adding segment: crawl/segments/20080620183424 > > LinkDb: adding segment: crawl/segments/20080620183434 > > LinkDb: done > > Indexer: starting > > Indexer: linkdb: crawl/linkdb > > Indexer: adding segment: crawl/segments/20080620183408 > > Indexer: adding segment: crawl/segments/20080620183424 > > Indexer: adding segment: crawl/segments/20080620183434 Optimizing > > index. > > Indexer: done > > Dedup: starting > > Dedup: adding indexes in: crawl/indexes Exception in thread "main" > > java.io.IOException: Job failed! > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > > at > > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: > > 43 > > 9) > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) > > > > > > > > > > Log when: Nutch uses the host and port parameters from the > > nutch-site.xml where I just placed anything > > ---------------------------------------------------------------------- > > -- > > ---------------------------------------- > > Fetcher: starting > > Fetcher: segment: crawl/segments/20080620184021 > > Fetcher: threads: 10 > > fetching http://iweb.corp.ebay.com/ > > fetching http://v4:10000/lib > > http.proxy.host = v4 > > <------------------------ > > http.proxy.port = 10000 > > <------------------------ > > http.timeout = 300000 > > http.content.limit = 262144 > > http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) > > protocol.plugin.check.blocking = true protocol.plugin.check.robots = > > true fetcher.server.delay = 1000 http.max.delays = 100 Configured > > Client http.proxy.host = v4 http.proxy.port = 10000 http.timeout = > > 300000 http.content.limit = 262144 http.agent = QuickSearch/Nutch-0.9 > > (khtan at ebay dot com) protocol.plugin.check.blocking = true > > protocol.plugin.check.robots = true fetcher.server.delay = 1000 > > http.max.delays = 100 Configured Client > > java.net.SocketException: Connection reset > > <------------------------ > > java.net.SocketException: Connection reset > > <------------------------ > > at java.net.SocketInputStream.read(SocketInputStream.java:168) > > at java.net.SocketInputStream.read(SocketInputStream.java:168) > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > at java.io.BufferedInputStream.read(BufferedInputStream.java:235) > > at java.io.BufferedInputStream.read(BufferedInputStream.java:235) > > at > > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7 > > 7) > > at > > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7 > > 7) > > at > > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) > > at > > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) > > at > > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j > > av > > a:1115) > > at > > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j > > av > > a:1115) > > at > > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC > > on > > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) > > at > > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC > > on > > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) > > at > > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod > > Ba > > se.java:1832) > > at > > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod > > Ba > > se.java:1832) > > at > > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa > > se > > .java:1590) > > at > > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa > > se > > .java:1590) > > at > > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja > > va > > :995) > > at > > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja > > va > > :995) > > at > > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http > > Me > > thodDirector.java:397) > > at > > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http > > Me > > thodDirector.java:397) > > at > > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet > > ho > > dDirector.java:170) > > at > > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet > > ho > > dDirector.java:170) > > at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java > > :3 > > 96) > > at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java > > :3 > > 96) > > at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java > > :3 > > 24) > > at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java > > :3 > > 24) > > at > > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja > > va:96) > > at > > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja > > va:96) > > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) > > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) > > at > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase > > .j > > ava:219) > > at > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase > > .j > > ava:219) > > at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) > > at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) > > fetch of http://v4:10000/lib failed with: java.net.SocketException: > > Connection reset > > fetch of http://iweb.corp.ebay.com/ failed with: > > java.net.SocketException: Connection reset > > Fetcher: done > > CrawlDb update: starting > > CrawlDb update: db: crawl/crawldb > > CrawlDb update: segments: [crawl/segments/20080620184021] CrawlDb > > update: additions allowed: true CrawlDb update: URL normalizing: true > > CrawlDb update: URL filtering: true CrawlDb update: Merging segment > > data into db. > > CrawlDb update: done > > LinkDb: starting > > LinkDb: linkdb: crawl/linkdb > > LinkDb: URL normalize: true > > LinkDb: URL filter: true > > LinkDb: adding segment: crawl/segments/20080620184000 > > LinkDb: adding segment: crawl/segments/20080620184010 > > LinkDb: adding segment: crawl/segments/20080620184021 > > LinkDb: done > > Indexer: starting > > Indexer: linkdb: crawl/linkdb > > Indexer: adding segment: crawl/segments/20080620184000 > > Indexer: adding segment: crawl/segments/20080620184010 > > Indexer: adding segment: crawl/segments/20080620184021 Optimizing > > index. > > Indexer: done > > Dedup: starting > > Dedup: adding indexes in: crawl/indexes Exception in thread "main" > > java.io.IOException: Job failed! > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604) > > at > > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java: > > 43 > > 9) > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:135) > > > > > > > > Thanks, > > Ann Del Rio > > > > -----Original Message----- > > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] > > Sent: Thursday, June 19, 2008 10:54 PM > > To: [email protected] > > Subject: Re: how does nutch connect to urls internally? > > > > Hi Ann, > > > > Regarding frames - this is not the problem here (with Nutch), as Nutch > > > doesn't even seem to be able to connect to your server. It never gets > > > to see the HTML and frames in it. Perhaps there is something useful > > in the logs not on the Nutch side, but on that v4 server. > > > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > ----- Original Message ---- > > > From: "Del Rio, Ann" > > > To: [email protected] > > > Sent: Thursday, June 19, 2008 6:54:15 PM > > > Subject: RE: how does nutch connect to urls internally? > > > > > > > > > Hello, > > > > > > I tried this simple junit program before I will try the nutch > > > classes for http, > > > > > > import java.io.BufferedInputStream; > > > import java.io.StringWriter; > > > import java.net.URL; > > > import junit.framework.TestCase; > > > public class BinDoxTest extends TestCase { > > > public void testHttp() { > > > try { > > > URL url = new > > > URL("http://v4:10000/lib"); > > > StringWriter writer = new StringWriter(); > > > BufferedInputStream in = new > > > BufferedInputStream(url.openStream()); > > > for (int c = in.read(); c != -1; c = > > > in.read()) { > > > writer.write(c); > > > } > > > System.out.println(writer); > > > } catch (Exception e) { > > > // TODO: handle exception > > > } > > > } > > > } > > > > > > And got the following output which is the same as if I issued a wget > > > > in linux shell. > > > > > > > > > "http://www.w3.org/TR/html4/loose.dtd"> > > > > > > > > > > > Bindox Library> > > > > > > href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico" > > > type="image/vnd.microsoft.icon"> > > > > > > > > > > > > > > > > > > border="4" frameborder="1" scrolling="no"> > > > > > > > > > marginwidth="0" marginheight="0" scrolling="no" frameborder="1" > > > resize=yes> > > > > > > > > > src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp' > > > marginwidth="0" marginheight="0" scrolling="no" frameborder="0" > > > resize=yes> > > > > > > > > > > > > > > > > > > Can you please help provide enlightenment if there is something > > > funky with this starting page of the website from where Nutch gives > > > me a > > > "SocketException: Connection Reset Error" when I run the nutch to > > > start indexing from the page above? Can nutch index "frames"? > > > > > > I will try http next as our network admin said it might be an issue > > > with VM Ware freezing or timing-out for http 1.0 and not http 1.1 > > > > > > Thanks, > > > Ann Del Rio > > > > > > -----Original Message----- > > > From: Susam Pal [mailto:[EMAIL PROTECTED] > > > Sent: Monday, June 16, 2008 9:48 AM > > > To: [email protected] > > > Subject: Re: how does nutch connect to urls internally? > > > > > > Hi, > > > > > > It depends on which protocol plugin is enabled in your > > > 'conf/nutch-site.xml'. The property to look for is 'plugins.include' > > > in the XML file. If this is not present in 'conf/nutch-site.xml', it > > > > means you are using the default 'plugins.include' of > > > 'conf/nutch-default.xml'. > > > > > > If protocol-http is enabled, then you have to go through the code > > > in:- > > > > > > > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http. > > > ja > > > va > > > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Htt > > > pR > > > es > > > ponse.java > > > > > > If protocol-httpclient is enabled, then you have to go through:- > > > > > > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/ht > > > tp > > > cl > > > ient/Http.java > > > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/ht > > > tp > > > cl > > > ient/HttpResponse.java > > > > > > Enabling DEBUG logs in 'conf/log4j.properties' will also give you > > > clues about the problems. The logs are written to 'logs/hadoop.log'. > > > To enable the DEBUG logs for a particular package, say, the > > > httpclient > > > > > package, you can open 'conf/log4j.properties' and add the following > > > line: > > > > > > log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout > > > > > > Regards, > > > Susam Pal > > > > > > On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann wrote: > > > > Good morning, > > > > > > > > Can you please point me to a Nutch documentation where I can find > > > > how nutch connects to the webpages when it crawls? I think it is > > > > through HTTP but i would like to confirm and get more details so i > > > > > can write a > > > > > > > very small test java program to connect to one of the webpages i > > > > am having trouble connecting / crawling. I bought Lucene in Action > > > > > and am > > > > > > > half way thru the book and so far there is very little about > Nutch. > > > > > > > > Thanks, > > > > Ann Del Rio
