Yes, I can reproduce it and it happens everytime. Apparently, it only happens to this website, that is why I was wondering it has something to do with the way the pages are formatted or fetched. All the other internal websites that I am crawling are fine, the difference is that the other URLs do not have port numbers and are more of static pages instead of a DITA framework that fetches and redirects the pages from a servlet.
At the same time, I am also checking with network security if it is a firewall issue or a port that they need to open for crawler-type traffic. Thanks, Ann Del Rio -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: Friday, May 30, 2008 2:16 PM To: [email protected] Subject: Re: Indexing XML-based document format per DITA standard It looks like you can indeed connect to that v4 machine from the machine running Nutch. I can't tell from here why you got the error you originally reported. Does it happen every time you try running Nutch? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: "Del Rio, Ann" <[EMAIL PROTECTED]> > To: [email protected] > Sent: Friday, May 30, 2008 3:23:00 PM > Subject: RE: Indexing XML-based document format per DITA standard > > Thank you for your response and help Otis! > I greatly appreciate it and am sure others will. > > > I did a wget from the machine where I was running Nutch and got the > following... > > -bash-2.05b$ wget http://v4:10000/lib > --10:37:52-- http://v4:10000/lib > => `lib.1' > Resolving v4... done. > Connecting to v4:10000... connected. > HTTP request sent, awaiting response... 200 OK > Length: 2,717 [text/html] > 100%[====================================>] 2,717 2.59M/s > ETA 00:00 > 10:37:52 (2.59 MB/s) - `lib.1' saved [2717/2717] > > Then I tried to telnet too and got a connection closed. > > -bash-2.05b$ telnet > telnet> open > (to) v4 10000 > Trying xxx.xxx.231.40... > Connected to xxxx.ebay.com (xxx.xxx.231.40). > Escape character is '^]'. > Connection closed by foreign host. > > Doesn't telnet service/ports need to be enabled on the other end's > server first before we can telnet to it? Does the nutch crawler use > telnet to fetch the URL? > > Apparently, we do not use proxy hosts and ports here at eBay in any of > our APIs, so I am not sure how to get those. But I will still ask > around if they know what proxy hosts and ports we are using. > > Also, when I browse the URL it is fine, so I checked my IE browser > options and checked on the LAN Settings to look for the proxy address > and port and we are not using any as well. > > > Thanks, > Ann Del Rio > > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] > Sent: Friday, May 30, 2008 10:17 AM > To: [email protected] > Subject: Re: Indexing XML-based document format per DITA standard > > Can you connect to it (telnet to it, for example) directly from the > machine(s) where you are running Nutch? > (this is a network issue, nothing to do with XML/parsing) > > > Maybe you need to go through some eBay proxy? > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- > > From: "Del Rio, Ann" > > To: [email protected] > > Sent: Friday, May 30, 2008 6:24:01 PM > > Subject: Indexing XML-based document format per DITA standard > > > > I added a new URL to index which is in a XML-based document format > > per > > > DITA standard and I get the following error. > > > > java.net.SocketException: Connection reset > > 2008-05-27 17:56:58 ERROR Http at > > java.net.SocketInputStream.read(SocketInputStream.java:168) > > 2008-05-27 17:56:58 ERROR Http at > > java.io.BufferedInputStream.fill(BufferedInputStream.java:218) > > 2008-05-27 17:56:58 ERROR Http at > > java.io.BufferedInputStream.read(BufferedInputStream.java:235) > > 2008-05-27 17:56:58 ERROR Http at > > > org.apache.commons.httpclient.HttpParser.readRawLine(HttpParser.java:7 > 7) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpParser.readLine(HttpParser.java:105) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpConnection.readLine(HttpConnection > > .j > > av > > a:1115) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager$Htt > > pC > > on > > nectionAdapter.readLine(MultiThreadedHttpConnectionManager.java:1373) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpMethodBase.readStatusLine(HttpMeth > > od > > Ba > > se.java:1832) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpMethodBase.readResponse(HttpMethod > > Ba > > se > > .java:1590) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpMethodBase.execute(HttpMethodBase. > > ja > > va > > :995) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(Ht > > tp > > Me > > thodDirector.java:397) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpM > > et > > ho > > dDirector.java:170) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja > > va > > :3 > > 96) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.ja > > va > > :3 > > 24) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.ja > > va:96) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:99) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBa > > se > > .j > > ava:219) > > 2008-05-27 17:56:58 ERROR Http at > > org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:145) > > 2008-05-27 17:56:58 INFO Fetcher fetch of > > http://v4:10000/lib failed with: > > java.net.SocketException: Connection reset > > > > i googled and found no solution so far... > > > > do i need to setup some config / host file to specify the ports? > > the URL is an internal website. > > > > any response will be appreciated. > > > > Thanks, > > Ann Del Rio > > Senior Developer > > eBay, Inc
