The fact that you got "java.net.SocketException: Connection reset" in that error tells you and your network people this is a networking problem.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "Del Rio, Ann" <[EMAIL PROTECTED]> > To: [email protected] > Sent: Friday, May 30, 2008 5:37:48 PM > Subject: RE: Indexing XML-based document format per DITA standard > > > Yes, I can reproduce it and it happens everytime. > > Apparently, it only happens to this website, that is why I was wondering > it has something to do with the way the pages are formatted or fetched. > All the other internal websites that I am crawling are fine, the > difference is that the other URLs do not have port numbers and are more > of static pages instead of a DITA framework that fetches and redirects > the pages from a servlet. > > At the same time, I am also checking with network security if it is a > firewall issue or a port that they need to open for crawler-type > traffic. > > Thanks, > Ann Del Rio > > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, May 30, 2008 2:16 PM > To: [email protected] > Subject: Re: Indexing XML-based document format per DITA standard > > It looks like you can indeed connect to that v4 machine from the machine > running Nutch. I can't tell from here why you got the error you > originally reported. Does it happen every time you try running Nutch? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- > > From: "Del Rio, Ann" > > To: [email protected] > > Sent: Friday, May 30, 2008 3:23:00 PM > > Subject: RE: Indexing XML-based document format per DITA standard > > > > Thank you for your response and help Otis! > > I greatly appreciate it and am sure others will. > > > > > > I did a wget from the machine where I was running Nutch and got the > > following... > > > > -bash-2.05b$ wget http://v4:10000/lib > > --10:37:52-- http://v4:10000/lib > > => `lib.1' > > Resolving v4... done. > > Connecting to v4:10000... connected. > > HTTP request sent, awaiting response... 200 OK > > Length: 2,717 [text/html] > > 100%[====================================>] 2,717 2.59M/s > > ETA 00:00 > > 10:37:52 (2.59 MB/s) - `lib.1' saved [2717/2717] > > > > Then I tried to telnet too and got a connection closed. > > > > -bash-2.05b$ telnet > > telnet> open > > (to) v4 10000 > > Trying xxx.xxx.231.40... > > Connected to xxxx.ebay.com (xxx.xxx.231.40). > > Escape character is '^]'. > > Connection closed by foreign host. > > > > Doesn't telnet service/ports need to be enabled on the other end's > > server first before we can telnet to it? Does the nutch crawler use > > telnet to fetch the URL? > > > > Apparently, we do not use proxy hosts and ports here at eBay in any of > > > our APIs, so I am not sure how to get those. But I will still ask > > around if they know what proxy hosts and ports we are using. > > > > Also, when I browse the URL it is fine, so I checked my IE browser > > options and checked on the LAN Settings to look for the proxy address > > and port and we are not using any as well. > > > > > > Thanks, > > Ann Del Rio > > > > -----Original Message----- > > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > > Sent: Friday, May 30, 2008 10:17 AM > > To: [email protected] > > Subject: Re: Indexing XML-based document format per DITA standard > > > > Can you connect to it (telnet to it, for example) directly from the > > machine(s) where you are running Nutch? > > (this is a network issue, nothing to do with XML/parsing) > > > > > > Maybe you need to go through some eBay proxy? > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > > > > ----- Original Message ---- > > > From: "Del Rio, Ann" > > > To: [email protected] > > > Sent: Friday, May 30, 2008 6:24:01 PM > > > Subject: Indexing XML-based document format per DITA standard > > > > > > I added a new URL to index which is in a XML-based document format > > > per > > > > > DITA standard and I get the following error. > > > > > > java.net.SocketException: Connection reset > > > 2008-05-27 17:56:58 ERROR Http at > > > java.net.SocketInputStream.read(SocketInputStream.java:168) > > > 2008-05-27 17:56:58 ERROR Http at > > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > > 2008-05-27 17:56:58 ERROR Http at > > > java.io.BufferedInputStream.read(BufferedInputStream.java:235) > > > 2008-05-27 17:56:58 ERROR Http at > > > > > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7 > > 7) > > > 2008-05-27 17:56:58 ERROR Http at > > > > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection > > > .j > > > av > > > a:1115) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Htt > > > pC > > > on > > > > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMeth > > > od > > > Ba > > > se.java:1832) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethod > > > Ba > > > se > > > .java:1590) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase. > > > ja > > > va > > > :995) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Ht > > > tp > > > Me > > > thodDirector.java:397) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpM > > > et > > > ho > > > dDirector.java:170) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja > > > va > > > :3 > > > 96) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja > > > va > > > :3 > > > 24) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja > > > va:96) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa > > > se > > > .j > > > ava:219) > > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) > > > 2008-05-27 17:56:58 INFO Fetcher fetch of > > > http://v4:10000/lib failed with: > > > java.net.SocketException: Connection reset > > > > > > i googled and found no solution so far... > > > > > > do i need to setup some config / host file to specify the ports? > > > the URL is an internal website. > > > > > > any response will be appreciated. > > > > > > Thanks, > > > Ann Del Rio > > > Senior Developer > > > eBay, Inc
