shouldProcessURL simply takes a URL and returns true/false to determine if the handler should process the URL. You can dictate what logic you do in your handler to determine if you want to process a URL or not. You'll note that the simple example in the codebase [1] simply returns true, A.K.A, process all URLs.
Have you gotten your handler to run on a simple example? Perhaps a single page test would be a good place to start to make sure everything is going as planned. [1] https://github.com/apache/nutch/blob/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java -- Jimmy On Sun, Sep 27, 2015 at 11:08 AM, Mattmann, Chris A (3980) < [email protected]> wrote: > Hi Team 18, > > This would be a good question and discussion to move > to the [email protected] list. So I’m moving it there. > Mike Joyce and Kim Whitehall who are working on Nutch and > Selenium can help there. > > Cheers, > Chris > > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Adjunct Associate Professor, Computer Science Department > University of Southern California > Los Angeles, CA 90089 USA > Email: [email protected] > WWW: http://sunset.usc.edu/ > +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > -----Original Message----- > From: Charan Shampur <[email protected]> > Date: Saturday, September 26, 2015 at 7:19 PM > To: jpluser <[email protected]> > Subject: CSCI - 572: Team 18 : Questions > > > > > > >Hello Professor, > > > > > >We started building the handler for interactive-selenium plugin. We > >figured out writing "processDriver()" method as part of > >InteractiveSeleniumHandler class. We are unable to figure out how to pass > >list of urls to "shouldProcessURL()" method of > >InteractiveSeleniumHandler. > > > > > >We made the necessary configuration changes in nutch-site.xml and other > >needed changes as mentioned in many online tutorials. After doing a fresh > >"ant runtime" and started crawling for the urls, mozilla browser opens up > >for some of the urls but crawler > > displays "java.net.SocketTimeoutException: Read Timed out" and it is > >continuing with next set of urls. We believe this message means the > >request is not being made since there is no url. so next time when the > >browser opened we manually typed some random url > > and then we could see that crawler continued execution with the newly > >fetched data. > > > > > >When the browser opens, the url field will always be empty. We are not > >able to understand how to pass url to the browser once it opens up so > >that the whole process is automated. > > > > > >Thanks > >Team 18 > > > > > > > > > > > >On Fri, Sep 25, 2015 at 9:32 PM, Christian Alan Mattmann > ><[email protected]> wrote: > > > >Hi Charan, > > > >You should get codes like DB_UNFETCHED, DB_GONE, etc etc > >via nutchpy. Roughly you can map those to various (HTTP) > >codes like DB_GONE (which is 404), etc. > > > >Does that help? > > > >Cheers, > >Chris > > > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > >Chris Mattmann, Ph.D. > >Adjunct Associate Professor, Computer Science Department > >University of Southern California > >Los Angeles, CA 90089 USA > >Email: [email protected] > >WWW: http://sunset.usc.edu/ > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > >-----Original Message----- > >From: Charan Shampur <[email protected]> > >Date: Friday, September 25, 2015 at 9:22 PM > >To: jpluser <[email protected]> > >Subject: Question with assignment 1 > > > >>Hello professor, > >> > >> > >>I examined the three nutch > >>datasets using nutchpy, I was able to extract the different image > >>mime-types that were encountered while fetching the image > >>urls, However I was unable to find the http response codes of the > >>urls that were being fetched. > >> > >> > >>hadoop.log files have a list of > >>urls which are not fetched due to response code 403, and other issues. Is > >>this the place to find those 100 > >>urls. > >> > >> > >>professor, Kindly guide us in the right direction. > >> > >> > >>Thanks, > >>Charan > >> > > > > > > > > > > > > > > > > > > > >

