Yes, I can reproduce it and it happens everytime.

Apparently, it only happens to this website, that is why I was wondering
it has something to do with the way the pages are formatted or fetched.
All the other internal websites that I am crawling are fine, the
difference is that the other URLs do not have port numbers and are more
of static pages instead of a DITA framework that fetches and redirects
the pages from a servlet.

At the same time, I am also checking with network security if it is a
firewall issue or a port that they need to open for crawler-type
traffic.

Thanks,
Ann Del Rio


-----Original Message-----
From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] 
Sent: Friday, May 30, 2008 2:16 PM
To: [email protected]
Subject: Re: Indexing XML-based document format per DITA standard

It looks like you can indeed connect to that v4 machine from the machine
running Nutch.  I can't tell from here why you got the error you
originally reported.  Does it happen every time you try running Nutch?

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, May 30, 2008 3:23:00 PM
> Subject: RE: Indexing XML-based document format per DITA standard
> 
> Thank you for your response and help Otis!
> I greatly appreciate it and am sure others will.
> 
> 
> I did a wget from the machine where I was running Nutch and got the 
> following...
> 
> -bash-2.05b$ wget http://v4:10000/lib
> --10:37:52--  http://v4:10000/lib
>            => `lib.1'
> Resolving v4... done.
> Connecting to v4:10000... connected.
> HTTP request sent, awaiting response... 200 OK
> Length: 2,717 [text/html]
> 100%[====================================>] 2,717          2.59M/s
> ETA 00:00
> 10:37:52 (2.59 MB/s) - `lib.1' saved [2717/2717]
> 
> Then I tried to telnet too and got a connection closed.
> 
> -bash-2.05b$ telnet
> telnet> open
> (to) v4 10000
> Trying xxx.xxx.231.40...
> Connected to xxxx.ebay.com (xxx.xxx.231.40).
> Escape character is '^]'.
> Connection closed by foreign host.
> 
> Doesn't telnet service/ports need to be enabled on the other end's 
> server first before we can telnet to it? Does the nutch crawler use 
> telnet to fetch the URL?
> 
> Apparently, we do not use proxy hosts and ports here at eBay in any of

> our APIs, so I am not sure how to get those. But I will still ask 
> around if they know what proxy hosts and ports we are using.
> 
> Also, when I browse the URL it is fine, so I checked my IE browser 
> options and checked on the LAN Settings to look for the proxy address 
> and port and we are not using any as well.
> 
> 
> Thanks,
> Ann Del Rio
> 
> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
> Sent: Friday, May 30, 2008 10:17 AM
> To: [email protected]
> Subject: Re: Indexing XML-based document format per DITA standard
> 
> Can you connect to it (telnet to it, for example) directly from the
> machine(s) where you are running Nutch?
> (this is a network issue, nothing to do with XML/parsing)
> 
> 
> Maybe you need to go through some eBay proxy?
> 
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> 
> ----- Original Message ----
> > From: "Del Rio, Ann" 
> > To: [email protected]
> > Sent: Friday, May 30, 2008 6:24:01 PM
> > Subject: Indexing XML-based document format per DITA standard
> > 
> > I added a new URL to index which is in a XML-based document format 
> > per
> 
> > DITA standard and I get the following error.
> > 
> > java.net.SocketException: Connection reset
> > 2008-05-27 17:56:58 ERROR Http                 at
> > java.net.SocketInputStream.read(SocketInputStream.java:168)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > java.io.BufferedInputStream.read(BufferedInputStream.java:235)
> > 2008-05-27 17:56:58 ERROR Http                 at
> >
> org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7
> 7)
> > 2008-05-27 17:56:58 ERROR Http                 at
> >
org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection
> > .j
> > av
> > a:1115)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Htt
> > pC
> > on
> >
nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMeth
> > od
> > Ba
> > se.java:1832)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethod
> > Ba
> > se
> > .java:1590)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase.
> > ja
> > va
> > :995)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Ht
> > tp
> > Me
> > thodDirector.java:397)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpM
> > et
> > ho
> > dDirector.java:170)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja
> > va
> > :3
> > 96)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja
> > va
> > :3
> > 24)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja
> > va:96)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa
> > se
> > .j
> > ava:219)
> > 2008-05-27 17:56:58 ERROR Http                 at
> > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145)
> > 2008-05-27 17:56:58 INFO  Fetcher              fetch of
> > http://v4:10000/lib   failed with:
> > java.net.SocketException: Connection reset
> > 
> > i googled and found no solution so far...
> > 
> > do i need to setup some config / host file to specify the ports?
> > the URL is an internal website.
> > 
> > any response will be appreciated.
> > 
> > Thanks,
> > Ann Del Rio
> > Senior Developer
> > eBay, Inc

Reply via email to