Re: CSCI - 572: Team 18 : Questions

Michael Joyce Mon, 28 Sep 2015 07:38:43 -0700

shouldProcessURL simply takes a URL and returns true/false to determine if
the handler should process the URL. You can dictate what logic you do in
your handler to determine if you want to process a URL or not. You'll note
that the simple example in the codebase [1] simply returns true, A.K.A,
process all URLs.


Have you gotten your handler to run on a simple example? Perhaps a single
page test would be a good place to start to make sure everything is going
as planned.

[1]
https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java


-- Jimmy

On Sun, Sep 27, 2015 at 11:08 AM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> Hi Team 18,
>
> This would be a good question and discussion to move
> to the [email protected] list. So I’m moving it there.
> Mike Joyce and Kim Whitehall who are working on Nutch and
> Selenium can help there.
>
> Cheers,
> Chris
>
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Adjunct Associate Professor, Computer Science Department
> University of Southern California
> Los Angeles, CA 90089 USA
> Email: [email protected]
> WWW: http://sunset.usc.edu/
> +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
> -----Original Message-----
> From: Charan Shampur <[email protected]>
> Date: Saturday, September 26, 2015 at 7:19 PM
> To: jpluser <[email protected]>
> Subject: CSCI - 572: Team 18 : Questions
>
> >
> >
> >Hello Professor,
> >
> >
> >We started building the handler for interactive-selenium plugin. We
> >figured out writing "processDriver()" method as part of
> >InteractiveSeleniumHandler class. We are unable to figure out how to pass
> >list of urls to "shouldProcessURL()" method of
> >InteractiveSeleniumHandler.
> >
> >
> >We made the necessary configuration changes in nutch-site.xml and other
> >needed changes as mentioned in many online tutorials. After doing a fresh
> >"ant runtime" and started crawling for the urls, mozilla browser opens up
> >for some of the urls but crawler
> > displays "java.net.SocketTimeoutException: Read Timed out" and it is
> >continuing with next set of urls. We believe this message means the
> >request is not being made since there is no url. so next time when the
> >browser opened we manually typed some random url
> > and then we could see that crawler continued execution with the newly
> >fetched data.
> >
> >
> >When the browser opens, the url field will always be empty. We are not
> >able to understand how to pass url to the browser once it opens up so
> >that the whole process is automated.
> >
> >
> >Thanks
> >Team 18
> >
> >
> >
> >
> >
> >On Fri, Sep 25, 2015 at 9:32 PM, Christian Alan Mattmann
> ><[email protected]> wrote:
> >
> >Hi Charan,
> >
> >You should get codes like DB_UNFETCHED, DB_GONE, etc etc
> >via nutchpy. Roughly you can map those to various (HTTP)
> >codes like DB_GONE (which is 404), etc.
> >
> >Does that help?
> >
> >Cheers,
> >Chris
> >
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >Chris Mattmann, Ph.D.
> >Adjunct Associate Professor, Computer Science Department
> >University of Southern California
> >Los Angeles, CA 90089 USA
> >Email: [email protected]
> >WWW: http://sunset.usc.edu/
> >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >-----Original Message-----
> >From: Charan Shampur <[email protected]>
> >Date: Friday, September 25, 2015 at 9:22 PM
> >To: jpluser <[email protected]>
> >Subject: Question with assignment 1
> >
> >>Hello professor,
> >>
> >>
> >>I examined the three nutch
> >>datasets using nutchpy, I was able to extract  the different  image
> >>mime-types that were encountered while fetching the image
> >>urls, However I was unable to find the http response codes of the
> >>urls that were being fetched.
> >>
> >>
> >>hadoop.log files have a list of
> >>urls which are not fetched due to response code 403, and other issues. Is
> >>this the place to find those 100
> >>urls.
> >>
> >>
> >>professor, Kindly guide us in the right direction.
> >>
> >>
> >>Thanks,
> >>Charan
> >>
> >
> >
> >
> >
> >
> >
> >
> >
> >
>
>

Re: CSCI - 572: Team 18 : Questions

Reply via email to