RE: how does nutch connect to urls internally?

Del Rio, Ann Mon, 23 Jun 2008 09:32:21 -0700

1. Should I ask our network folks to enable telnet service in the v4
box?
Nutch use http, does Nutch use telnet? Because I tried the following,


        -bash-2.05b$ telnet v4 10000
        Trying 10.254.231.40...
        Connected to vm-v4dev01.arch.ebay.com (10.254.231.40).
        Escape character is '^]'.
        Connection closed by foreign host.

2. I was able to do this on the other hand and got the following,

-bash-2.05b$ curl http://v4:10000/lib

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.or
g/TR/html4/loose.dtd">
<HTML>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Bindox Library</title>
<link rel="icon"
href="/classpath/com/ebay/content/sharedcontent/images/favicon.
ico" type="image/vnd.microsoft.icon">
<script language="JavaScript">

function topicLoaded(href, title) {
        ContentFrame.ContentToolbarFrame.setTitle(title);
}

var maximizeListeners=new Object();
function registerMaximizeListener(name, listener){
        maximizeListeners[name]=listener;
}
function notifyMaximizeListeners(name, maximizedNotRestored){
        maximizeListeners[name](maximizedNotRestored);
}

var leftCols = "29.5%";
var rightCols = "70.5%";

// called from *Toolbar pages
function toggleFrame(title)
{
        var frameset = document.getElementById("BindoxFrameset");
        var navFrameSize = frameset.getAttribute("cols");
        var comma = navFrameSize.indexOf(',');
        var left = navFrameSize.substring(0,comma);
        var right = navFrameSize.substring(comma+1);

        if (left == "*" || right == "*") {
                // restore frames
                frameset.frameSpacing="3";
                frameset.setAttribute("border", "6");
                frameset.setAttribute("cols", leftCols+","+rightCols);
                notifyMaximizeListeners(title, false);
        } else {
                // the "cols" attribute is not always accurate,
especially after
resizing.
                // offsetWidth is also not accurate, so we do a
combination of b
oth and
                // should get a reasonable behavior

                var leftSize = NavFrame.document.body.offsetWidth;
                var rightSize = ContentFrame.document.body.offsetWidth;


                leftCols = leftSize * 100 / (leftSize + rightSize);
                rightCols = 100 - leftCols;

                // maximize the frame.
                //leftCols = left;
                //rightCols = right;
                if (title == "Contents") // this is the content toolbar
                        frameset.setAttribute("cols", "*,100%");
                else // this is the left side for left-to-right
rendering
                        frameset.setAttribute("cols", "100%,*");

                frameset.frameSpacing="0";
                frameset.setAttribute("border", "1");
                notifyMaximizeListeners(title, true);
        }
}

</script>

</head>

<frameset id="BindoxFrameset" cols="29.5%,70.5%" framespacing="4"
border="4"  fr
ameborder="1"   scrolling="no">

        <frame class="nav" name="NavFrame" title="Layout frame:
NavFrame" src='/
com/ebay/content/sharedcontent/toc/NavFrame.jsp?null' marginwidth="0"
marginheig
ht="0" scrolling="no" frameborder="1" resize=yes>

        <frame class="content" name="ContentFrame" title="Layout frame:
ContentF
rame" src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
marginwidth="
0" marginheight="0" scrolling="no" frameborder="0" resize=yes>

</frameset>
</HTML>


Thanks, 
Ann Del Rio

-----Original Message-----
From: Otis Gospodnetic [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 20, 2008 10:55 PM
To: [email protected]
Subject: Re: how does nutch connect to urls internally?

That proxy port does look a little suspicious.  I can't check my
installation to tell you with certainty if that proxy port should be
printed like that or should be null, too.

Not sure if we went through this already, but can you:

$ telnet v4 10000
GET /lib HTTP/1.0
(hit enter a few times here)

What happens?

 
Or:
curl http://v4:1000/lib ?

Or, if you have libwww:
GET -UsSed http://v4:10000/lib

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, June 20, 2008 9:53:03 PM
> Subject: RE: how does nutch connect to urls internally?
> 
> Hi,
> 
> I do not have access on the other website's server so I could not see 
> what is going on the other side, all I know is that when I run the 
> website on a browser, I see the pages and documentation and their 
> website is running fine.
> 
> Is there a way to tell Nutch NOT to use the http host and port because

> the server and website I am crawling are on the same network segment? 
> Or does Nutch ignore these parameters when they are null? I found a 
> difference in the log file when I place just "any" http host and port 
> in the nutch-site.xml file.
> 
> Log when: Nutch uses the default host and port if I entirely delete 
> the host and port parameters from the nutch-site.xml
> ----------------------------------------------------------------------
> --
> ----------------------------------------
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080620183434
> Fetcher: threads: 10
> fetching http://v4:10000/lib
> http.proxy.host = null
> <------------------------
> http.proxy.port = 8080
> <------------------------
> http.timeout = 300000
> http.content.limit = 262144
> http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) 
> protocol.plugin.check.blocking = true protocol.plugin.check.robots = 
> true fetcher.server.delay = 1000 http.max.delays = 100 Configured 
> Client
> java.net.SocketException: Connection reset
> <------------------------
> at java.net.SocketInputStream.read(SocketInputStream.java:168)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> at
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> 7)
> at
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> at
> org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j
> av
> a:1115)
> at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC
> on
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> at
> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod
> Ba
> se.java:1832)
> at
> org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa
> se
> .java:1590)
> at
> org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja
> va
> :995)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http
> Me
> thodDirector.java:397)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet
> ho
> dDirector.java:170)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> :3
> 96)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> :3
> 24)
> at
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> va:96)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase
> .j
> ava:219)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> fetch of http://v4:10000/lib failed with: java.net.SocketException:
> Connection reset
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080620183434] CrawlDb 
> update: additions allowed: true CrawlDb update: URL normalizing: true 
> CrawlDb update: URL filtering: true CrawlDb update: Merging segment 
> data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080620183408
> LinkDb: adding segment: crawl/segments/20080620183424
> LinkDb: adding segment: crawl/segments/20080620183434
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080620183408
> Indexer: adding segment: crawl/segments/20080620183424
> Indexer: adding segment: crawl/segments/20080620183434 Optimizing 
> index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes Exception in thread "main" 
> java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>     at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
> 43
> 9)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> 
> 
> 
> 
> Log when: Nutch uses the host and port parameters from the 
> nutch-site.xml where I just placed anything
> ----------------------------------------------------------------------
> --
> ----------------------------------------
> Fetcher: starting
> Fetcher: segment: crawl/segments/20080620184021
> Fetcher: threads: 10
> fetching http://iweb.corp.ebay.com/
> fetching http://v4:10000/lib
> http.proxy.host = v4
> <------------------------
> http.proxy.port = 10000
> <------------------------
> http.timeout = 300000
> http.content.limit = 262144
> http.agent = QuickSearch/Nutch-0.9 (khtan at ebay dot com) 
> protocol.plugin.check.blocking = true protocol.plugin.check.robots = 
> true fetcher.server.delay = 1000 http.max.delays = 100 Configured 
> Client http.proxy.host = v4 http.proxy.port = 10000 http.timeout = 
> 300000 http.content.limit = 262144 http.agent = QuickSearch/Nutch-0.9 
> (khtan at ebay dot com) protocol.plugin.check.blocking = true 
> protocol.plugin.check.robots = true fetcher.server.delay = 1000 
> http.max.delays = 100 Configured Client
> java.net.SocketException: Connection reset
> <------------------------
> java.net.SocketException: Connection reset
> <------------------------
> at java.net.SocketInputStream.read(SocketInputStream.java:168)
> at java.net.SocketInputStream.read(SocketInputStream.java:168)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> at java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> at
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> 7)
> at
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> 7)
> at
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> at
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> at
> org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j
> av
> a:1115)
> at
> org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection.j
> av
> a:1115)
> at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC
> on
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> at
> org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$HttpC
> on
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> at
> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod
> Ba
> se.java:1832)
> at
> org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMethod
> Ba
> se.java:1832)
> at
> org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa
> se
> .java:1590)
> at
> org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethodBa
> se
> .java:1590)
> at
> org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja
> va
> :995)
> at
> org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.ja
> va
> :995)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http
> Me
> thodDirector.java:397)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Http
> Me
> thodDirector.java:397)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet
> ho
> dDirector.java:170)
> at
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMet
> ho
> dDirector.java:170)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> :3
> 96)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> :3
> 96)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> :3
> 24)
> at
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java
> :3
> 24)
> at
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> va:96)
> at
> org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> va:96)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase
> .j
> ava:219)
> at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase
> .j
> ava:219)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> at 
> org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> fetch of http://v4:10000/lib failed with: java.net.SocketException:
> Connection reset
> fetch of http://iweb.corp.ebay.com/ failed with:
> java.net.SocketException: Connection reset
> Fetcher: done
> CrawlDb update: starting
> CrawlDb update: db: crawl/crawldb
> CrawlDb update: segments: [crawl/segments/20080620184021] CrawlDb 
> update: additions allowed: true CrawlDb update: URL normalizing: true 
> CrawlDb update: URL filtering: true CrawlDb update: Merging segment 
> data into db.
> CrawlDb update: done
> LinkDb: starting
> LinkDb: linkdb: crawl/linkdb
> LinkDb: URL normalize: true
> LinkDb: URL filter: true
> LinkDb: adding segment: crawl/segments/20080620184000
> LinkDb: adding segment: crawl/segments/20080620184010
> LinkDb: adding segment: crawl/segments/20080620184021
> LinkDb: done
> Indexer: starting
> Indexer: linkdb: crawl/linkdb
> Indexer: adding segment: crawl/segments/20080620184000
> Indexer: adding segment: crawl/segments/20080620184010
> Indexer: adding segment: crawl/segments/20080620184021 Optimizing 
> index.
> Indexer: done
> Dedup: starting
> Dedup: adding indexes in: crawl/indexes Exception in thread "main" 
> java.io.IOException: Job failed!
>     at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604)
>     at
> org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplicates.java:
> 43
> 9)
>     at org.apache.nutch.crawl.Crawl.main(Crawl.java:135)
> 
> 
> 
> Thanks,
> Ann Del Rio
> 
> -----Original Message-----
> From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
> Sent: Thursday, June 19, 2008 10:54 PM
> To: [email protected]
> Subject: Re: how does nutch connect to urls internally?
> 
> Hi Ann,
> 
> Regarding frames - this is not the problem here (with Nutch), as Nutch

> doesn't even seem to be able to connect to your server.  It never gets

> to see the HTML and frames in it.  Perhaps there is something useful 
> in the logs not on the Nutch side, but on that v4 server.
> 
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: "Del Rio, Ann" 
> > To: [email protected]
> > Sent: Thursday, June 19, 2008 6:54:15 PM
> > Subject: RE: how does nutch connect to urls internally?
> > 
> > 
> > Hello,
> > 
> > I tried this simple junit program before I will try the nutch 
> > classes for http,
> > 
> >     import java.io.BufferedInputStream;
> >     import java.io.StringWriter;
> >     import java.net.URL;
> >     import junit.framework.TestCase;
> >     public class BinDoxTest extends TestCase {
> >         public void testHttp() {
> >             try {
> >                 URL url = new
> > URL("http://v4:10000/lib";);
> >                 StringWriter writer = new StringWriter();
> >                 BufferedInputStream in = new 
> > BufferedInputStream(url.openStream());
> >                 for (int c = in.read(); c != -1; c =
> > in.read()) {
> >                     writer.write(c);
> >                 }
> >                 System.out.println(writer);
> >             } catch (Exception e) {
> >                 // TODO: handle exception
> >             }
> >         }
> >     }
> > 
> > And got the following output which is the same as if I issued a wget

> > in linux shell.
> > 
> > 
> > "http://www.w3.org/TR/html4/loose.dtd";>
> > 
> > 
> > 
> Bindox Library>
> > 
> > href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
> > type="image/vnd.microsoft.icon">
> > 
> > 
> > 
> > 
> > 
> > border="4"  frameborder="1"   scrolling="no">
> > 
> >        
> > marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
> > resize=yes>
> > 
> >        
> > src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
> > marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
> > resize=yes>
> > 
> > 
> > 
> > 
> > 
> > Can you please help provide enlightenment if there is something 
> > funky with this starting page of the website from where Nutch gives 
> > me a
> > "SocketException: Connection Reset Error" when I run the nutch to 
> > start indexing from the page above? Can nutch index "frames"?
> > 
> > I will try http next as our network admin said it might be an issue 
> > with VM Ware freezing or timing-out for http 1.0 and not http 1.1
> > 
> > Thanks,
> > Ann Del Rio
> > 
> > -----Original Message-----
> > From: Susam Pal [mailto:[EMAIL PROTECTED]
> > Sent: Monday, June 16, 2008 9:48 AM
> > To: [email protected]
> > Subject: Re: how does nutch connect to urls internally?
> > 
> > Hi,
> > 
> > It depends on which protocol plugin is enabled in your 
> > 'conf/nutch-site.xml'. The property to look for is 'plugins.include'
> > in the XML file. If this is not present in 'conf/nutch-site.xml', it

> > means you are using the default 'plugins.include' of 
> > 'conf/nutch-default.xml'.
> > 
> > If protocol-http is enabled, then you have to go through the code 
> > in:-
> > 
> >
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.
> > ja
> > va
> > src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Htt
> > pR
> > es
> > ponse.java
> > 
> > If protocol-httpclient is enabled, then you have to go through:-
> > 
> > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/ht
> > tp
> > cl
> > ient/Http.java
> > src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/ht
> > tp
> > cl
> > ient/HttpResponse.java
> > 
> > Enabling DEBUG logs in 'conf/log4j.properties' will also give you 
> > clues about the problems. The logs are written to 'logs/hadoop.log'.
> > To enable the DEBUG logs for a particular package, say, the 
> > httpclient
> 
> > package, you can open 'conf/log4j.properties' and add the following
> > line:
> > 
> > log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
> > 
> > Regards,
> > Susam Pal
> > 
> > On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann wrote:
> > > Good morning,
> > >
> > > Can you please point me to a Nutch documentation where I can find 
> > > how nutch connects to the webpages when it crawls? I think it is 
> > > through HTTP but i would like to confirm and get more details so i

> > > can write a
> > 
> > > very small test java program to connect to one of the webpages i 
> > > am having trouble connecting / crawling. I bought Lucene in Action

> > > and am
> > 
> > > half way thru the book and so far there is very little about
Nutch.
> > >
> > > Thanks,
> > > Ann Del Rio

RE: how does nutch connect to urls internally?

Reply via email to