Hi Ann,

Regarding frames - this is not the problem here (with Nutch), as Nutch doesn't 
even seem to be able to connect to your server.  It never gets to see the HTML 
and frames in it.  Perhaps there is something useful in the logs not on the 
Nutch side, but on that v4 server.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: "Del Rio, Ann" <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Thursday, June 19, 2008 6:54:15 PM
> Subject: RE: how does nutch connect to urls internally?
> 
> 
> Hello,
> 
> I tried this simple junit program before I will try the nutch classes
> for http,
> 
>     import java.io.BufferedInputStream;
>     import java.io.StringWriter;
>     import java.net.URL;
>     import junit.framework.TestCase;
>     public class BinDoxTest extends TestCase {
>         public void testHttp() {
>             try {
>                 URL url = new
> URL("http://v4:10000/lib";);
>                 StringWriter writer = new
> StringWriter();
>                 BufferedInputStream in = new
> BufferedInputStream(url.openStream());
>                 for (int c = in.read(); c != -1; c =
> in.read()) {
>                     writer.write(c);
>                 }
>                 System.out.println(writer);
>             } catch (Exception e) {
>                 // TODO: handle exception
>             }
>         }
>     }
> 
> And got the following output which is the same as if I issued a wget in
> linux shell.
> 
> 
> "http://www.w3.org/TR/html4/loose.dtd";>
> 
> 
> 
Bindox Library> 
> 
> href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
> type="image/vnd.microsoft.icon">
> 
> 
> 
> 
> 
> border="4"  frameborder="1"   scrolling="no">
> 
>        
> marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
> resize=yes>
> 
>        
> src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
> marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
> resize=yes>
> 
> 
> 
> 
> 
> Can you please help provide enlightenment if there is something funky
> with this starting page of the website from where Nutch gives me a
> "SocketException: Connection Reset Error" when I run the nutch to start
> indexing from the page above? Can nutch index "frames"?
> 
> I will try http next as our network admin said it might be an issue with
> VM Ware freezing or timing-out for http 1.0 and not http 1.1
> 
> Thanks,
> Ann Del Rio
> 
> -----Original Message-----
> From: Susam Pal [mailto:[EMAIL PROTECTED] 
> Sent: Monday, June 16, 2008 9:48 AM
> To: [email protected]
> Subject: Re: how does nutch connect to urls internally?
> 
> Hi,
> 
> It depends on which protocol plugin is enabled in your
> 'conf/nutch-site.xml'. The property to look for is 'plugins.include'
> in the XML file. If this is not present in 'conf/nutch-site.xml', it
> means you are using the default 'plugins.include' of
> 'conf/nutch-default.xml'.
> 
> If protocol-http is enabled, then you have to go through the code in:-
> 
> src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja
> va
> src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes
> ponse.java
> 
> If protocol-httpclient is enabled, then you have to go through:-
> 
> src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
> ient/Http.java
> src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
> ient/HttpResponse.java
> 
> Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues
> about the problems. The logs are written to 'logs/hadoop.log'.
> To enable the DEBUG logs for a particular package, say, the httpclient
> package, you can open 'conf/log4j.properties' and add the following
> line:
> 
> log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout
> 
> Regards,
> Susam Pal
> 
> On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann wrote:
> > Good morning,
> >
> > Can you please point me to a Nutch documentation where I can find how 
> > nutch connects to the webpages when it crawls? I think it is through 
> > HTTP but i would like to confirm and get more details so i can write a
> 
> > very small test java program to connect to one of the webpages i am 
> > having trouble connecting / crawling. I bought Lucene in Action and am
> 
> > half way thru the book and so far there is very little about Nutch.
> >
> > Thanks,
> > Ann Del Rio

Reply via email to