The fact that you got "java.net.SocketException: Connection reset" in that 
error tells you and your network people this is a networking problem.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, May 30, 2008 5:37:48 PM
> Subject: RE: Indexing XML-based document format per DITA standard
> 
> 
> Yes, I can reproduce it and it happens everytime.
> 
> Apparently, it only happens to this website, that is why I was wondering
> it has something to do with the way the pages are formatted or fetched.
> All the other internal websites that I am crawling are fine, the
> difference is that the other URLs do not have port numbers and are more
> of static pages instead of a DITA framework that fetches and redirects
> the pages from a servlet.
> 
> At the same time, I am also checking with network security if it is a
> firewall issue or a port that they need to open for crawler-type
> traffic.
> 
> Thanks,
> Ann Del Rio
> 
> 
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
> Sent: Friday, May 30, 2008 2:16 PM
> To: [email protected]
> Subject: Re: Indexing XML-based document format per DITA standard
> 
> It looks like you can indeed connect to that v4 machine from the machine
> running Nutch.  I can't tell from here why you got the error you
> originally reported.  Does it happen every time you try running Nutch?
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: "Del Rio, Ann" 
> > To: [email protected]
> > Sent: Friday, May 30, 2008 3:23:00 PM
> > Subject: RE: Indexing XML-based document format per DITA standard
> > 
> > Thank you for your response and help Otis!
> > I greatly appreciate it and am sure others will.
> > 
> > 
> > I did a wget from the machine where I was running Nutch and got the 
> > following...
> > 
> > -bash-2.05b$ wget http://v4:10000/lib
> > --10:37:52--  http://v4:10000/lib
> >            => `lib.1'
> > Resolving v4... done.
> > Connecting to v4:10000... connected.
> > HTTP request sent, awaiting response... 200 OK
> > Length: 2,717 [text/html]
> > 100%[====================================>] 2,717          2.59M/s
> > ETA 00:00
> > 10:37:52 (2.59 MB/s) - `lib.1' saved [2717/2717]
> > 
> > Then I tried to telnet too and got a connection closed.
> > 
> > -bash-2.05b$ telnet
> > telnet> open
> > (to) v4 10000
> > Trying xxx.xxx.231.40...
> > Connected to xxxx.ebay.com (xxx.xxx.231.40).
> > Escape character is '^]'.
> > Connection closed by foreign host.
> > 
> > Doesn't telnet service/ports need to be enabled on the other end's 
> > server first before we can telnet to it? Does the nutch crawler use 
> > telnet to fetch the URL?
> > 
> > Apparently, we do not use proxy hosts and ports here at eBay in any of
> 
> > our APIs, so I am not sure how to get those. But I will still ask 
> > around if they know what proxy hosts and ports we are using.
> > 
> > Also, when I browse the URL it is fine, so I checked my IE browser 
> > options and checked on the LAN Settings to look for the proxy address 
> > and port and we are not using any as well.
> > 
> > 
> > Thanks,
> > Ann Del Rio
> > 
> > -----Original Message-----
> > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> > Sent: Friday, May 30, 2008 10:17 AM
> > To: [email protected]
> > Subject: Re: Indexing XML-based document format per DITA standard
> > 
> > Can you connect to it (telnet to it, for example) directly from the
> > machine(s) where you are running Nutch?
> > (this is a network issue, nothing to do with XML/parsing)
> > 
> > 
> > Maybe you need to go through some eBay proxy?
> > 
> > Otis
> > --
> > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> > 
> > 
> > ----- Original Message ----
> > > From: "Del Rio, Ann" 
> > > To: [email protected]
> > > Sent: Friday, May 30, 2008 6:24:01 PM
> > > Subject: Indexing XML-based document format per DITA standard
> > > 
> > > I added a new URL to index which is in a XML-based document format 
> > > per
> > 
> > > DITA standard and I get the following error.
> > > 
> > > java.net.SocketException: Connection reset
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > java.net.SocketInputStream.read(SocketInputStream.java:168)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > >
> > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> > 7)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > >
> org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection
> > > .j
> > > av
> > > a:1115)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Htt
> > > pC
> > > on
> > >
> nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMeth
> > > od
> > > Ba
> > > se.java:1832)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethod
> > > Ba
> > > se
> > > .java:1590)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.
> > > ja
> > > va
> > > :995)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Ht
> > > tp
> > > Me
> > > thodDirector.java:397)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpM
> > > et
> > > ho
> > > dDirector.java:170)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja
> > > va
> > > :3
> > > 96)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja
> > > va
> > > :3
> > > 24)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> > > va:96)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa
> > > se
> > > .j
> > > ava:219)
> > > 2008-05-27 17:56:58 ERROR Http                 at
> > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > > 2008-05-27 17:56:58 INFO  Fetcher              fetch of
> > > http://v4:10000/lib   failed with:
> > > java.net.SocketException: Connection reset
> > > 
> > > i googled and found no solution so far...
> > > 
> > > do i need to setup some config / host file to specify the ports?
> > > the URL is an internal website.
> > > 
> > > any response will be appreciated.
> > > 
> > > Thanks,
> > > Ann Del Rio
> > > Senior Developer
> > > eBay, Inc

Reply via email to