Re: how does nutch connect to urls internally?

Otis Gospodnetic Fri, 20 Jun 2008 22:55:57 -0700

That proxy port does look a little suspicious.  I can't check my installation 
to tell you with certainty if that proxy port should be printed like that or 
should be null, too.


Not sure if we went through this already, but can you:

$ telnet v4 10000
GET /lib HTTP/1.0
(hit enter a few times here)

What happens?

 
Or:
curl http://v4:1000/lib ?

Or, if you have libwww:
GET -UsSed http://v4:10000/lib

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, June 20, 2008 9:53:03 PM
> Subject: RE: how does nutch connect to urls internally?
> 
> Hi,
> 
> I do not have access on the other website's server so I could not see
> what is going on the other side, all I know is that when I run the
> website on a browser, I see the pages and documentation and their
> website is running fine.
> 
> Is there a way to tell Nutch NOT to use the http host and port because
> the server and website I am crawling are on the same network segment? Or
> does Nutch ignore these parameters when they are null? I found a
> difference in the log file when I place just "any" http host and port in
> the nutch-site.xml file. 
> 
> Log when: Nutch uses the default host and port if I entirely delete the
> host and port parameters from the nutch-site.xml
> ------------------------------------------------------------------------
> ----------------------------------------
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080620183434
> Fetcher: threads: 10
> fetching http://v4:10000/lib
> http.proxy.host = null
> <------------------------
> http.proxy.port = 8080
> <------------------------
> http.timeout = 300000
> http.content.limit = 262144
> http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> java.net.SocketException: Connection reset
> <------------------------
> at java.net.SocketInputStream.read(SocketInputStream.java:168)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> at
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
> at
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> at
> org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav
> a:1115)
> at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> at
> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa
> se.java:1832)
> at
> org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase
> .java:1590)
> at
> org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java
> :995)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe
> thodDirector.java:397)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
> dDirector.java:170)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
> 96)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
> 24)
> at
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> va:96)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j
> ava:219)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> fetch of http://v4:10000/lib failed with: java.net.SocketException:
> Connection reset
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080620183434]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080620183408
> LinkDb: adding segment: crawl/segments/20080620183424
> LinkDb: adding segment: crawl/segments/20080620183434
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080620183408
> Indexer: adding segment: crawl/segments/20080620183424
> Indexer: adding segment: crawl/segments/20080620183434
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>     at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
> 9)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> 
> 
> 
> 
> Log when: Nutch uses the host and port parameters from the
> nutch-site.xml where I just placed anything
> ------------------------------------------------------------------------
> ----------------------------------------
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080620184021
> Fetcher: threads: 10
> fetching http://iweb.corp.ebay.com/
> fetching http://v4:10000/lib
> http.proxy.host = v4
> <------------------------
> http.proxy.port = 10000
> <------------------------
> http.timeout = 300000
> http.content.limit = 262144
> http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> http.proxy.host = v4
> http.proxy.port = 10000
> http.timeout = 300000
> http.content.limit = 262144
> http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com)
> protocol.plugin.check.blocking = true
> protocol.plugin.check.robots = true
> fetcher.server.delay = 1000
> http.max.delays = 100
> Configured Client
> java.net.SocketException: Connection reset
> <------------------------
> java.net.SocketException: Connection reset
> <------------------------
> at java.net.SocketInputStream.read(SocketInputStream.java:168)
> at java.net.SocketInputStream.read(SocketInputStream.java:168)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> at
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
> at
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
> at
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> at
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> at
> org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav
> a:1115)
> at
> org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav
> a:1115)
> at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> at
> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa
> se.java:1832)
> at
> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa
> se.java:1832)
> at
> org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase
> .java:1590)
> at
> org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase
> .java:1590)
> at
> org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java
> :995)
> at
> org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java
> :995)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe
> thodDirector.java:397)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe
> thodDirector.java:397)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
> dDirector.java:170)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
> dDirector.java:170)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
> 96)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
> 96)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
> 24)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
> 24)
> at
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> va:96)
> at
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> va:96)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j
> ava:219)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j
> ava:219)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> fetch of http://v4:10000/lib failed with: java.net.SocketException:
> Connection reset
> fetch of http://iweb.corp.ebay.com/ failed with:
> java.net.SocketException: Connection reset
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080620184021]
> CrawlDb update: additions allowed: true
> CrawlDb update: URL normalizing: true
> CrawlDb update: URL filtering: true
> CrawlDb update: Merging segment data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080620184000
> LinkDb: adding segment: crawl/segments/20080620184010
> LinkDb: adding segment: crawl/segments/20080620184021
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080620184000
> Indexer: adding segment: crawl/segments/20080620184010
> Indexer: adding segment: crawl/segments/20080620184021
> Optimizing index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes
> Exception in thread "main" java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>     at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
> 9)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> 
> 
> 
> Thanks, 
> Ann Del Rio
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
> Sent: Thursday, June 19, 2008 10:54 PM
> To: [email protected]
> Subject: Re: how does nutch connect to urls internally?
> 
> Hi Ann,
> 
> Regarding frames - this is not the problem here (with Nutch), as Nutch
> doesn't even seem to be able to connect to your server.  It never gets
> to see the HTML and frames in it.  Perhaps there is something useful in
> the logs not on the Nutch side, but on that v4 server.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: "Del Rio, Ann" 
> > To: [email protected]
> > Sent: Thursday, June 19, 2008 6:54:15 PM
> > Subject: RE: how does nutch connect to urls internally?
> > 
> > 
> > Hello,
> > 
> > I tried this simple junit program before I will try the nutch classes 
> > for http,
> > 
> >     import java.io.BufferedInputStream;
> >     import java.io.StringWriter;
> >     import java.net.URL;
> >     import junit.framework.TestCase;
> >     public class BinDoxTest extends TestCase {
> >         public void testHttp() {
> >             try {
> >                 URL url = new
> > URL("http://v4:10000/lib";);
> >                 StringWriter writer = new StringWriter();
> >                 BufferedInputStream in = new 
> > BufferedInputStream(url.openStream());
> >                 for (int c = in.read(); c != -1; c =
> > in.read()) {
> >                     writer.write(c);
> >                 }
> >                 System.out.println(writer);
> >             } catch (Exception e) {
> >                 // TODO: handle exception
> >             }
> >         }
> >     }
> > 
> > And got the following output which is the same as if I issued a wget 
> > in linux shell.
> > 
> > 
> > "http://www.w3.org/TR/html4/loose.dtd";>
> > 
> > 
> > 
> Bindox Library> 
> > 
> > href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
> > type="image/vnd.microsoft.icon">
> > 
> > 
> > 
> > 
> > 
> > border="4"  frameborder="1"   scrolling="no">
> > 
> >        
> > marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
> > resize=yes>
> > 
> >        
> > src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
> > marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
> > resize=yes>
> > 
> > 
> > 
> > 
> > 
> > Can you please help provide enlightenment if there is something funky 
> > with this starting page of the website from where Nutch gives me a
> > "SocketException: Connection Reset Error" when I run the nutch to 
> > start indexing from the page above? Can nutch index "frames"?
> > 
> > I will try http next as our network admin said it might be an issue 
> > with VM Ware freezing or timing-out for http 1.0 and not http 1.1
> > 
> > Thanks,
> > Ann Del Rio
> > 
> > -----Original Message-----
> > From: Susam Pal [mailto:[EMAIL PROTECTED]
> > Sent: Monday, June 16, 2008 9:48 AM
> > To: [email protected]
> > Subject: Re: how does nutch connect to urls internally?
> > 
> > Hi,
> > 
> > It depends on which protocol plugin is enabled in your 
> > 'conf/nutch-site.xml'. The property to look for is 'plugins.include'
> > in the XML file. If this is not present in 'conf/nutch-site.xml', it 
> > means you are using the default 'plugins.include' of 
> > 'conf/nutch-default.xml'.
> > 
> > If protocol-http is enabled, then you have to go through the code in:-
> > 
> > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.
> > ja
> > va
> > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpR
> > es
> > ponse.java
> > 
> > If protocol-httpclient is enabled, then you have to go through:-
> > 
> > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/http
> > cl
> > ient/Http.java
> > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/http
> > cl
> > ient/HttpResponse.java
> > 
> > Enabling DEBUG logs in 'conf/log4j.properties' will also give you 
> > clues about the problems. The logs are written to 'logs/hadoop.log'.
> > To enable the DEBUG logs for a particular package, say, the httpclient
> 
> > package, you can open 'conf/log4j.properties' and add the following
> > line:
> > 
> > log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
> > 
> > Regards,
> > Susam Pal
> > 
> > On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann wrote:
> > > Good morning,
> > >
> > > Can you please point me to a Nutch documentation where I can find 
> > > how nutch connects to the webpages when it crawls? I think it is 
> > > through HTTP but i would like to confirm and get more details so i 
> > > can write a
> > 
> > > very small test java program to connect to one of the webpages i am 
> > > having trouble connecting / crawling. I bought Lucene in Action and 
> > > am
> > 
> > > half way thru the book and so far there is very little about Nutch.
> > >
> > > Thanks,
> > > Ann Del Rio

Re: how does nutch connect to urls internally?

Reply via email to