RE: how does nutch connect to urls internally?

Del Rio, Ann Fri, 20 Jun 2008 18:54:05 -0700

Hi,

I do not have access on the other website's server so I could not see
what is going on the other side, all I know is that when I run the
website on a browser, I see the pages and documentation and their
website is running fine.


Is there a way to tell Nutch NOT to use the http host and port because
the server and website I am crawling are on the same network segment? Or
does Nutch ignore these parameters when they are null? I found a
difference in the log file when I place just "any" http host and port in
the nutch-site.xml file. 

Log when: Nutch uses the default host and port if I entirely delete the
host and port parameters from the nutch-site.xml
------------------------------------------------------------------------
----------------------------------------
Fetcher: starting
Fetcher: segment: crawl/segments/20080620183434
Fetcher: threads: 10
fetching http://v4:10000/lib
http.proxy.host = null
<------------------------
http.proxy.port = 8080
<------------------------
http.timeout = 300000
http.content.limit = 262144
http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com)
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 1000
http.max.delays = 100
Configured Client
java.net.SocketException: Connection reset
<------------------------
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
at
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
at
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
at
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav
a:1115)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
at
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa
se.java:1832)
at
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase
.java:1590)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java
:995)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe
thodDirector.java:397)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
dDirector.java:170)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
96)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
24)
at
org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.ja
va:96)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j
ava:219)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
fetch of http://v4:10000/lib failed with: java.net.SocketException:
Connection reset
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080620183434]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080620183408
LinkDb: adding segment: crawl/segments/20080620183424
LinkDb: adding segment: crawl/segments/20080620183434
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080620183408
Indexer: adding segment: crawl/segments/20080620183424
Indexer: adding segment: crawl/segments/20080620183434
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)




Log when: Nutch uses the host and port parameters from the
nutch-site.xml where I just placed anything
------------------------------------------------------------------------
----------------------------------------
Fetcher: starting
Fetcher: segment: crawl/segments/20080620184021
Fetcher: threads: 10
fetching http://iweb.corp.ebay.com/
fetching http://v4:10000/lib
http.proxy.host = v4
<------------------------
http.proxy.port = 10000
<------------------------
http.timeout = 300000
http.content.limit = 262144
http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com)
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 1000
http.max.delays = 100
Configured Client
http.proxy.host = v4
http.proxy.port = 10000
http.timeout = 300000
http.content.limit = 262144
http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com)
protocol.plugin.check.blocking = true
protocol.plugin.check.robots = true
fetcher.server.delay = 1000
http.max.delays = 100
Configured Client
java.net.SocketException: Connection reset
<------------------------
java.net.SocketException: Connection reset
<------------------------
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.net.SocketInputStream.read(SocketInputStream.java:168)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
at
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
at
org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:77)
at
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
at
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
at
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav
a:1115)
at
org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.jav
a:1115)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
at
org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpCon
nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
at
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa
se.java:1832)
at
org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethodBa
se.java:1832)
at
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase
.java:1590)
at
org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBase
.java:1590)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java
:995)
at
org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.java
:995)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe
thodDirector.java:397)
at
org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMe
thodDirector.java:397)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
dDirector.java:170)
at
org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMetho
dDirector.java:170)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
96)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
96)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
24)
at
org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:3
24)
at
org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.ja
va:96)
at
org.apache.nutch.protocol.httpclient.HttpResponse.<init>(HttpResponse.ja
va:96)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j
ava:219)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.j
ava:219)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
fetch of http://v4:10000/lib failed with: java.net.SocketException:
Connection reset
fetch of http://iweb.corp.ebay.com/ failed with:
java.net.SocketException: Connection reset
Fetcher: done
CrawlDb update: starting
CrawlDb update: db: crawl/crawldb
CrawlDb update: segments: [crawl/segments/20080620184021]
CrawlDb update: additions allowed: true
CrawlDb update: URL normalizing: true
CrawlDb update: URL filtering: true
CrawlDb update: Merging segment data into db.
CrawlDb update: done
LinkDb: starting
LinkDb: linkdb: crawl/linkdb
LinkDb: URL normalize: true
LinkDb: URL filter: true
LinkDb: adding segment: crawl/segments/20080620184000
LinkDb: adding segment: crawl/segments/20080620184010
LinkDb: adding segment: crawl/segments/20080620184021
LinkDb: done
Indexer: starting
Indexer: linkdb: crawl/linkdb
Indexer: adding segment: crawl/segments/20080620184000
Indexer: adding segment: crawl/segments/20080620184010
Indexer: adding segment: crawl/segments/20080620184021
Optimizing index.
Indexer: done
Dedup: starting
Dedup: adding indexes in: crawl/indexes
Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
        at
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:43
9)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)



Thanks, 
Ann Del Rio

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Thursday, June 19, 2008 10:54 PM
To: [email protected]
Subject: Re: how does nutch connect to urls internally?

Hi Ann,

Regarding frames - this is not the problem here (with Nutch), as Nutch
doesn't even seem to be able to connect to your server.  It never gets
to see the HTML and frames in it.  Perhaps there is something useful in
the logs not on the Nutch side, but on that v4 server.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, June 19, 2008 6:54:15 PM
> Subject: RE: how does nutch connect to urls internally?
> 
> 
> Hello,
> 
> I tried this simple junit program before I will try the nutch classes 
> for http,
> 
>     import java.io.BufferedInputStream;
>     import java.io.StringWriter;
>     import java.net.URL;
>     import junit.framework.TestCase;
>     public class BinDoxTest extends TestCase {
>         public void testHttp() {
>             try {
>                 URL url = new
> URL("http://v4:10000/lib";);
>                 StringWriter writer = new StringWriter();
>                 BufferedInputStream in = new 
> BufferedInputStream(url.openStream());
>                 for (int c = in.read(); c != -1; c =
> in.read()) {
>                     writer.write(c);
>                 }
>                 System.out.println(writer);
>             } catch (Exception e) {
>                 // TODO: handle exception
>             }
>         }
>     }
> 
> And got the following output which is the same as if I issued a wget 
> in linux shell.
> 
> 
> "http://www.w3.org/TR/html4/loose.dtd";>
> 
> 
> 
Bindox Library> 
> 
> href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
> type="image/vnd.microsoft.icon">
> 
> 
> 
> 
> 
> border="4"  frameborder="1"   scrolling="no">
> 
>        
> marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
> resize=yes>
> 
>        
> src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
> marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
> resize=yes>
> 
> 
> 
> 
> 
> Can you please help provide enlightenment if there is something funky 
> with this starting page of the website from where Nutch gives me a
> "SocketException: Connection Reset Error" when I run the nutch to 
> start indexing from the page above? Can nutch index "frames"?
> 
> I will try http next as our network admin said it might be an issue 
> with VM Ware freezing or timing-out for http 1.0 and not http 1.1
> 
> Thanks,
> Ann Del Rio
> 
> -----Original Message-----
> From: Susam Pal [mailto:[EMAIL PROTECTED]
> Sent: Monday, June 16, 2008 9:48 AM
> To: [email protected]
> Subject: Re: how does nutch connect to urls internally?
> 
> Hi,
> 
> It depends on which protocol plugin is enabled in your 
> 'conf/nutch-site.xml'. The property to look for is 'plugins.include'
> in the XML file. If this is not present in 'conf/nutch-site.xml', it 
> means you are using the default 'plugins.include' of 
> 'conf/nutch-default.xml'.
> 
> If protocol-http is enabled, then you have to go through the code in:-
> 
> src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.
> ja
> va
> src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpR
> es
> ponse.java
> 
> If protocol-httpclient is enabled, then you have to go through:-
> 
> src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/http
> cl
> ient/Http.java
> src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/http
> cl
> ient/HttpResponse.java
> 
> Enabling DEBUG logs in 'conf/log4j.properties' will also give you 
> clues about the problems. The logs are written to 'logs/hadoop.log'.
> To enable the DEBUG logs for a particular package, say, the httpclient

> package, you can open 'conf/log4j.properties' and add the following
> line:
> 
> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
> 
> Regards,
> Susam Pal
> 
> On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann wrote:
> > Good morning,
> >
> > Can you please point me to a Nutch documentation where I can find 
> > how nutch connects to the webpages when it crawls? I think it is 
> > through HTTP but i would like to confirm and get more details so i 
> > can write a
> 
> > very small test java program to connect to one of the webpages i am 
> > having trouble connecting / crawling. I bought Lucene in Action and 
> > am
> 
> > half way thru the book and so far there is very little about Nutch.
> >
> > Thanks,
> > Ann Del Rio

RE: how does nutch connect to urls internally?

Reply via email to