Re: how does nutch connect to urls internally?

Otis Gospodnetic Mon, 23 Jun 2008 10:25:20 -0700

Hi Ann,

No, the v4 doesn't have to support telnet.  You are connecting to port 10000 on 
v4, which is supposed to be some kind of a HTTP server.  It looks like that 
server closes the connection on you.  I assume that "Connection closed...." 
happens immediately, ja?


Curl either has something smart in it to keep this connection from closing, or 
v4 somehow knows who/what connected and likes curl more than telnet.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Monday, June 23, 2008 12:30:49 PM
> Subject: RE: how does nutch connect to urls internally?
> 
> 1. Should I ask our network folks to enable telnet service in the v4
> box?
> Nutch use http, does Nutch use telnet? Because I tried the following,
> 
>     -bash-2.05b$ telnet v4 10000
>     Trying 10.254.231.40...
>     Connected to vm-v4dev01.arch.ebay.com (10.254.231.40).
>     Escape character is '^]'.
>     Connection closed by foreign host.
> 
> 2. I was able to do this on the other hand and got the following,
> 
> -bash-2.05b$ curl http://v4:10000/lib
> 
> 
> "http://www.w3.or
> g/TR/html4/loose.dtd">
> 
> 
> 
> Bindox Library
> 
> href="/classpath/com/ebay/content/sharedcontent/images/favicon.
> ico" type="image/vnd.microsoft.icon">
> 
> 
> 
> 
> 
> border="4"  fr
> ameborder="1"   scrolling="no">
> 
>         
> marginheig
> ht="0" scrolling="no" frameborder="1" resize=yes>
> 
>         
> marginwidth="
> 0" marginheight="0" scrolling="no" frameborder="0" resize=yes>
> 
> 
> 
> 
> 
> Thanks, 
> Ann Del Rio
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
> Sent: Friday, June 20, 2008 10:55 PM
> To: [email protected]
> Subject: Re: how does nutch connect to urls internally?
> 
> That proxy port does look a little suspicious.  I can't check my
> installation to tell you with certainty if that proxy port should be
> printed like that or should be null, too.
> 
> Not sure if we went through this already, but can you:
> 
> $ telnet v4 10000
> GET /lib HTTP/1.0
> (hit enter a few times here)
> 
> What happens?
> 
> 
> Or:
> curl http://v4:1000/lib ?
> 
> Or, if you have libwww:
> GET -UsSed http://v4:10000/lib
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: "Del Rio, Ann" 
> > To: [email protected]
> > Sent: Friday, June 20, 2008 9:53:03 PM
> > Subject: RE: how does nutch connect to urls internally?
> > 
> > Hi,
> > 
> > I do not have access on the other website's server so I could not see 
> > what is going on the other side, all I know is that when I run the 
> > website on a browser, I see the pages and documentation and their 
> > website is running fine.
> > 
> > Is there a way to tell Nutch NOT to use the http host and port because
> 
> > the server and website I am crawling are on the same network segment? 
> > Or does Nutch ignore these parameters when they are null? I found a 
> > difference in the log file when I place just "any" http host and port 
> > in the nutch-site.xml file.
> > 
> > Log when: Nutch uses the default host and port if I entirely delete 
> > the host and port parameters from the nutch-site.xml
> > ----------------------------------------------------------------------
> > --
> > ----------------------------------------
> > Fetcher: starting
> > Fetcher: segment: crawl/segments/20080620183434
> > Fetcher: threads: 10
> > fetching http://v4:10000/lib
> > http.proxy.host = null
> > <------------------------
> > http.proxy.port = 8080
> > <------------------------
> > http.timeout = 300000
> > http.content.limit = 262144
> > http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) 
> > protocol.plugin.check.blocking = true protocol.plugin.check.robots = 
> > true fetcher.server.delay = 1000 http.max.delays = 100 Configured 
> > Client
> > java.net.SocketException: Connection reset
> > <------------------------
> > at java.net.SocketInputStream.read(SocketInputStream.java:168)
> > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> > at
> > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> > 7)
> > at
> > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> > at
> > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j
> > av
> > a:1115)
> > at
> > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC
> > on
> > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod
> > Ba
> > se.java:1832)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa
> > se
> > .java:1590)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja
> > va
> > :995)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http
> > Me
> > thodDirector.java:397)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet
> > ho
> > dDirector.java:170)
> > at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> > :3
> > 96)
> > at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> > :3
> > 24)
> > at
> > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> > va:96)
> > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> > at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase
> > .j
> > ava:219)
> > at 
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > fetch of http://v4:10000/lib failed with: java.net.SocketException:
> > Connection reset
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20080620183434] CrawlDb 
> > update: additions allowed: true CrawlDb update: URL normalizing: true 
> > CrawlDb update: URL filtering: true CrawlDb update: Merging segment 
> > data into db.
> > CrawlDb update: done
> > LinkDb: starting
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment: crawl/segments/20080620183408
> > LinkDb: adding segment: crawl/segments/20080620183424
> > LinkDb: adding segment: crawl/segments/20080620183434
> > LinkDb: done
> > Indexer: starting
> > Indexer: linkdb: crawl/linkdb
> > Indexer: adding segment: crawl/segments/20080620183408
> > Indexer: adding segment: crawl/segments/20080620183424
> > Indexer: adding segment: crawl/segments/20080620183434 Optimizing 
> > index.
> > Indexer: done
> > Dedup: starting
> > Dedup: adding indexes in: crawl/indexes Exception in thread "main" 
> > java.io.IOException: Job failed!
> >     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> >     at
> > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
> > 43
> > 9)
> >     at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> > 
> > 
> > 
> > 
> > Log when: Nutch uses the host and port parameters from the 
> > nutch-site.xml where I just placed anything
> > ----------------------------------------------------------------------
> > --
> > ----------------------------------------
> > Fetcher: starting
> > Fetcher: segment: crawl/segments/20080620184021
> > Fetcher: threads: 10
> > fetching http://iweb.corp.ebay.com/
> > fetching http://v4:10000/lib
> > http.proxy.host = v4
> > <------------------------
> > http.proxy.port = 10000
> > <------------------------
> > http.timeout = 300000
> > http.content.limit = 262144
> > http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) 
> > protocol.plugin.check.blocking = true protocol.plugin.check.robots = 
> > true fetcher.server.delay = 1000 http.max.delays = 100 Configured 
> > Client http.proxy.host = v4 http.proxy.port = 10000 http.timeout = 
> > 300000 http.content.limit = 262144 http.agent = QuickSearch/Nutch-0.9 
> > (khtan at ebay dot com) protocol.plugin.check.blocking = true 
> > protocol.plugin.check.robots = true fetcher.server.delay = 1000 
> > http.max.delays = 100 Configured Client
> > java.net.SocketException: Connection reset
> > <------------------------
> > java.net.SocketException: Connection reset
> > <------------------------
> > at java.net.SocketInputStream.read(SocketInputStream.java:168)
> > at java.net.SocketInputStream.read(SocketInputStream.java:168)
> > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> > at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> > at
> > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> > 7)
> > at
> > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> > 7)
> > at
> > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> > at
> > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> > at
> > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j
> > av
> > a:1115)
> > at
> > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j
> > av
> > a:1115)
> > at
> > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC
> > on
> > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> > at
> > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC
> > on
> > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod
> > Ba
> > se.java:1832)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod
> > Ba
> > se.java:1832)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa
> > se
> > .java:1590)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa
> > se
> > .java:1590)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja
> > va
> > :995)
> > at
> > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja
> > va
> > :995)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http
> > Me
> > thodDirector.java:397)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http
> > Me
> > thodDirector.java:397)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet
> > ho
> > dDirector.java:170)
> > at
> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet
> > ho
> > dDirector.java:170)
> > at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> > :3
> > 96)
> > at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> > :3
> > 96)
> > at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> > :3
> > 24)
> > at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> > :3
> > 24)
> > at
> > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> > va:96)
> > at
> > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> > va:96)
> > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> > at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase
> > .j
> > ava:219)
> > at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase
> > .j
> > ava:219)
> > at 
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > at 
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > fetch of http://v4:10000/lib failed with: java.net.SocketException:
> > Connection reset
> > fetch of http://iweb.corp.ebay.com/ failed with:
> > java.net.SocketException: Connection reset
> > Fetcher: done
> > CrawlDb update: starting
> > CrawlDb update: db: crawl/crawldb
> > CrawlDb update: segments: [crawl/segments/20080620184021] CrawlDb 
> > update: additions allowed: true CrawlDb update: URL normalizing: true 
> > CrawlDb update: URL filtering: true CrawlDb update: Merging segment 
> > data into db.
> > CrawlDb update: done
> > LinkDb: starting
> > LinkDb: linkdb: crawl/linkdb
> > LinkDb: URL normalize: true
> > LinkDb: URL filter: true
> > LinkDb: adding segment: crawl/segments/20080620184000
> > LinkDb: adding segment: crawl/segments/20080620184010
> > LinkDb: adding segment: crawl/segments/20080620184021
> > LinkDb: done
> > Indexer: starting
> > Indexer: linkdb: crawl/linkdb
> > Indexer: adding segment: crawl/segments/20080620184000
> > Indexer: adding segment: crawl/segments/20080620184010
> > Indexer: adding segment: crawl/segments/20080620184021 Optimizing 
> > index.
> > Indexer: done
> > Dedup: starting
> > Dedup: adding indexes in: crawl/indexes Exception in thread "main" 
> > java.io.IOException: Job failed!
> >     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
> >     at
> > org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
> > 43
> > 9)
> >     at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> > 
> > 
> > 
> > Thanks,
> > Ann Del Rio
> > 
> > -----Original Message-----
> > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> > Sent: Thursday, June 19, 2008 10:54 PM
> > To: [email protected]
> > Subject: Re: how does nutch connect to urls internally?
> > 
> > Hi Ann,
> > 
> > Regarding frames - this is not the problem here (with Nutch), as Nutch
> 
> > doesn't even seem to be able to connect to your server.  It never gets
> 
> > to see the HTML and frames in it.  Perhaps there is something useful 
> > in the logs not on the Nutch side, but on that v4 server.
> > 
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> > > From: "Del Rio, Ann" 
> > > To: [email protected]
> > > Sent: Thursday, June 19, 2008 6:54:15 PM
> > > Subject: RE: how does nutch connect to urls internally?
> > > 
> > > 
> > > Hello,
> > > 
> > > I tried this simple junit program before I will try the nutch 
> > > classes for http,
> > > 
> > >     import java.io.BufferedInputStream;
> > >     import java.io.StringWriter;
> > >     import java.net.URL;
> > >     import junit.framework.TestCase;
> > >     public class BinDoxTest extends TestCase {
> > >         public void testHttp() {
> > >             try {
> > >                 URL url = new
> > > URL("http://v4:10000/lib";);
> > >                 StringWriter writer = new StringWriter();
> > >                 BufferedInputStream in = new 
> > > BufferedInputStream(url.openStream());
> > >                 for (int c = in.read(); c != -1; c =
> > > in.read()) {
> > >                     writer.write(c);
> > >                 }
> > >                 System.out.println(writer);
> > >             } catch (Exception e) {
> > >                 // TODO: handle exception
> > >             }
> > >         }
> > >     }
> > > 
> > > And got the following output which is the same as if I issued a wget
> 
> > > in linux shell.
> > > 
> > > 
> > > "http://www.w3.org/TR/html4/loose.dtd";>
> > > 
> > > 
> > > 
> > Bindox Library>
> > > 
> > > href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
> > > type="image/vnd.microsoft.icon">
> > > 
> > > 
> > > 
> > > 
> > > 
> > > border="4"  frameborder="1"   scrolling="no">
> > > 
> > >        
> > > marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
> > > resize=yes>
> > > 
> > >        
> > > src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
> > > marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
> > > resize=yes>
> > > 
> > > 
> > > 
> > > 
> > > 
> > > Can you please help provide enlightenment if there is something 
> > > funky with this starting page of the website from where Nutch gives 
> > > me a
> > > "SocketException: Connection Reset Error" when I run the nutch to 
> > > start indexing from the page above? Can nutch index "frames"?
> > > 
> > > I will try http next as our network admin said it might be an issue 
> > > with VM Ware freezing or timing-out for http 1.0 and not http 1.1
> > > 
> > > Thanks,
> > > Ann Del Rio
> > > 
> > > -----Original Message-----
> > > From: Susam Pal [mailto:[EMAIL PROTECTED]
> > > Sent: Monday, June 16, 2008 9:48 AM
> > > To: [email protected]
> > > Subject: Re: how does nutch connect to urls internally?
> > > 
> > > Hi,
> > > 
> > > It depends on which protocol plugin is enabled in your 
> > > 'conf/nutch-site.xml'. The property to look for is 'plugins.include'
> > > in the XML file. If this is not present in 'conf/nutch-site.xml', it
> 
> > > means you are using the default 'plugins.include' of 
> > > 'conf/nutch-default.xml'.
> > > 
> > > If protocol-http is enabled, then you have to go through the code 
> > > in:-
> > > 
> > >
> src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.
> > > ja
> > > va
> > > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Htt
> > > pR
> > > es
> > > ponse.java
> > > 
> > > If protocol-httpclient is enabled, then you have to go through:-
> > > 
> > > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/ht
> > > tp
> > > cl
> > > ient/Http.java
> > > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/ht
> > > tp
> > > cl
> > > ient/HttpResponse.java
> > > 
> > > Enabling DEBUG logs in 'conf/log4j.properties' will also give you 
> > > clues about the problems. The logs are written to 'logs/hadoop.log'.
> > > To enable the DEBUG logs for a particular package, say, the 
> > > httpclient
> > 
> > > package, you can open 'conf/log4j.properties' and add the following
> > > line:
> > > 
> > > log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
> > > 
> > > Regards,
> > > Susam Pal
> > > 
> > > On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann wrote:
> > > > Good morning,
> > > >
> > > > Can you please point me to a Nutch documentation where I can find 
> > > > how nutch connects to the webpages when it crawls? I think it is 
> > > > through HTTP but i would like to confirm and get more details so i
> 
> > > > can write a
> > > 
> > > > very small test java program to connect to one of the webpages i 
> > > > am having trouble connecting / crawling. I bought Lucene in Action
> 
> > > > and am
> > > 
> > > > half way thru the book and so far there is very little about
> Nutch.
> > > >
> > > > Thanks,
> > > > Ann Del Rio

Re: how does nutch connect to urls internally?

Reply via email to