Hello,

I tried this simple junit program before I will try the nutch classes
for http,

        import java.io.BufferedInputStream;
        import java.io.StringWriter;
        import java.net.URL;
        import junit.framework.TestCase;
        public class BinDoxTest extends TestCase {
                public void testHttp() {
                        try {
                                URL url = new
URL("http://v4:10000/lib";);
                                StringWriter writer = new
StringWriter();
                                BufferedInputStream in = new
BufferedInputStream(url.openStream());
                                for (int c = in.read(); c != -1; c =
in.read()) {
                                        writer.write(c);
                                }
                                System.out.println(writer);
                        } catch (Exception e) {
                                // TODO: handle exception
                        }
                }
        }

And got the following output which is the same as if I issued a wget in
linux shell.

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd";>
<HTML>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title>Bindox Library</title>
<link rel="icon"
href="/classpath/com/ebay/content/sharedcontent/images/favicon.ico"
type="image/vnd.microsoft.icon">
<script language="JavaScript">

function topicLoaded(href, title) {
        ContentFrame.ContentToolbarFrame.setTitle(title);
}

var maximizeListeners=new Object();
function registerMaximizeListener(name, listener){
        maximizeListeners[name]=listener;
}
function notifyMaximizeListeners(name, maximizedNotRestored){
        maximizeListeners[name](maximizedNotRestored);
}

var leftCols = "29.5%";
var rightCols = "70.5%";

// called from *Toolbar pages
function toggleFrame(title)
{
        var frameset = document.getElementById("BindoxFrameset"); 
        var navFrameSize = frameset.getAttribute("cols");
        var comma = navFrameSize.indexOf(',');
        var left = navFrameSize.substring(0,comma);
        var right = navFrameSize.substring(comma+1);

        if (left == "*" || right == "*") {
                // restore frames
                frameset.frameSpacing="3";
                frameset.setAttribute("border", "6");
                frameset.setAttribute("cols", leftCols+","+rightCols);
                notifyMaximizeListeners(title, false);
        } else {
                // the "cols" attribute is not always accurate,
especially after resizing.
                // offsetWidth is also not accurate, so we do a
combination of both and 
                // should get a reasonable behavior

                var leftSize = NavFrame.document.body.offsetWidth;
                var rightSize = ContentFrame.document.body.offsetWidth;

                
                leftCols = leftSize * 100 / (leftSize + rightSize);
                rightCols = 100 - leftCols;

                // maximize the frame.
                //leftCols = left;
                //rightCols = right;
                if (title == "Contents") // this is the content toolbar
                        frameset.setAttribute("cols", "*,100%");
                else // this is the left side for left-to-right
rendering
                        frameset.setAttribute("cols", "100%,*");
        
                frameset.frameSpacing="0";
                frameset.setAttribute("border", "1");
                notifyMaximizeListeners(title, true);
        }
}

</script>

</head>

<frameset id="BindoxFrameset" cols="29.5%,70.5%" framespacing="4"
border="4"  frameborder="1"   scrolling="no">

        <frame class="nav" name="NavFrame" title="Layout frame:
NavFrame" src='/com/ebay/content/sharedcontent/toc/NavFrame.jsp?null'
marginwidth="0" marginheight="0" scrolling="no" frameborder="1"
resize=yes>

        <frame class="content" name="ContentFrame" title="Layout frame:
ContentFrame"
src='/com/ebay/content/sharedcontent/topic/ContentFrame.jsp'
marginwidth="0" marginheight="0" scrolling="no" frameborder="0"
resize=yes>

</frameset>
</HTML>


Can you please help provide enlightenment if there is something funky
with this starting page of the website from where Nutch gives me a
"SocketException: Connection Reset Error" when I run the nutch to start
indexing from the page above? Can nutch index "frames"?

I will try http next as our network admin said it might be an issue with
VM Ware freezing or timing-out for http 1.0 and not http 1.1

Thanks,
Ann Del Rio

-----Original Message-----
From: Susam Pal [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 16, 2008 9:48 AM
To: [email protected]
Subject: Re: how does nutch connect to urls internally?

Hi,

It depends on which protocol plugin is enabled in your
'conf/nutch-site.xml'. The property to look for is 'plugins.include'
in the XML file. If this is not present in 'conf/nutch-site.xml', it
means you are using the default 'plugins.include' of
'conf/nutch-default.xml'.

If protocol-http is enabled, then you have to go through the code in:-

src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.ja
va
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpRes
ponse.java

If protocol-httpclient is enabled, then you have to go through:-

src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/Http.java
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpcl
ient/HttpResponse.java

Enabling DEBUG logs in 'conf/log4j.properties' will also give you clues
about the problems. The logs are written to 'logs/hadoop.log'.
To enable the DEBUG logs for a particular package, say, the httpclient
package, you can open 'conf/log4j.properties' and add the following
line:

log4j.logger.org.apache.nutch.protocol.httpclient=DEBUG,cmdstdout

Regards,
Susam Pal

On Mon, Jun 16, 2008 at 9:52 PM, Del Rio, Ann <[EMAIL PROTECTED]> wrote:
> Good morning,
>
> Can you please point me to a Nutch documentation where I can find how 
> nutch connects to the webpages when it crawls? I think it is through 
> HTTP but i would like to confirm and get more details so i can write a

> very small test java program to connect to one of the webpages i am 
> having trouble connecting / crawling. I bought Lucene in Action and am

> half way thru the book and so far there is very little about Nutch.
>
> Thanks,
> Ann Del Rio

Reply via email to