Hi Team 18, This would be a good question and discussion to move to the [email protected] list. So I’m moving it there. Mike Joyce and Kim Whitehall who are working on Nutch and Selenium can help there.
Cheers, Chris +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department University of Southern California Los Angeles, CA 90089 USA Email: [email protected] WWW: http://sunset.usc.edu/ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Charan Shampur <[email protected]> Date: Saturday, September 26, 2015 at 7:19 PM To: jpluser <[email protected]> Subject: CSCI - 572: Team 18 : Questions > > >Hello Professor, > > >We started building the handler for interactive-selenium plugin. We >figured out writing "processDriver()" method as part of >InteractiveSeleniumHandler class. We are unable to figure out how to pass >list of urls to "shouldProcessURL()" method of >InteractiveSeleniumHandler. > > >We made the necessary configuration changes in nutch-site.xml and other >needed changes as mentioned in many online tutorials. After doing a fresh >"ant runtime" and started crawling for the urls, mozilla browser opens up >for some of the urls but crawler > displays "java.net.SocketTimeoutException: Read Timed out" and it is >continuing with next set of urls. We believe this message means the >request is not being made since there is no url. so next time when the >browser opened we manually typed some random url > and then we could see that crawler continued execution with the newly >fetched data. > > >When the browser opens, the url field will always be empty. We are not >able to understand how to pass url to the browser once it opens up so >that the whole process is automated. > > >Thanks >Team 18 > > > > > >On Fri, Sep 25, 2015 at 9:32 PM, Christian Alan Mattmann ><[email protected]> wrote: > >Hi Charan, > >You should get codes like DB_UNFETCHED, DB_GONE, etc etc >via nutchpy. Roughly you can map those to various (HTTP) >codes like DB_GONE (which is 404), etc. > >Does that help? > >Cheers, >Chris > >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >Chris Mattmann, Ph.D. >Adjunct Associate Professor, Computer Science Department >University of Southern California >Los Angeles, CA 90089 USA >Email: [email protected] >WWW: http://sunset.usc.edu/ >+++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > >-----Original Message----- >From: Charan Shampur <[email protected]> >Date: Friday, September 25, 2015 at 9:22 PM >To: jpluser <[email protected]> >Subject: Question with assignment 1 > >>Hello professor, >> >> >>I examined the three nutch >>datasets using nutchpy, I was able to extract the different image >>mime-types that were encountered while fetching the image >>urls, However I was unable to find the http response codes of the >>urls that were being fetched. >> >> >>hadoop.log files have a list of >>urls which are not fetched due to response code 403, and other issues. Is >>this the place to find those 100 >>urls. >> >> >>professor, Kindly guide us in the right direction. >> >> >>Thanks, >>Charan >> > > > > > > > > >

