Hi Charan, Thanks for your questions. Please copy your emails to [email protected] and subscribe there, as you will find more help I believe.
Here are the answers: -----Original Message----- From: Charan Shampur <[email protected]> Date: Sunday, September 20, 2015 at 3:55 PM To: jpluser <[email protected]> Subject: Questions regarding CS-572 assignment 1 >Hello professor, > > >Sorry to interrupt you, I have few questions wandering in my mind from >last 2 days. >Here are those: > > >1) I was unable to find any guidelines for using nutchpy to extract data >from the crawldb. Can you provide me with Some pointers to resources that >will help. > The README.md on nutchpy explains how to use it to read Sequence Files: https://github.com/ContinuumIO/nutchpy/#running Then, if you look up the Nutch Sequence File format: http://wiki.apache.org/nutch/NutchFileFormats You should be good. > >2) How do we read or understand the data extracted by nutch?.I was able >to collect the list of urls that are crawled by running the readdb >command. >For others, how do we do it? You read the data out of the Nutch DB using NutchPy. So, in fact, readDB is a great tool (there are also tools to read the LinkDB), but you need to write a program using NutchPy. > > >3) Is there any API or command that interacts with nutch crawldb to get >the Statistical data(Mime type, Http response, Un-fetched urls, etc) ? Yep the data is stored in the Nutch Data file formats specified and linked above. > > >I have been reading through the nutch/wiki and was unable to figure it >out. > > > >professor, Kindly help me in resolving these... > > >Thanks, >Charan HTH. Cheers, Chris +++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Adjunct Associate Professor, Computer Science Department University of Southern California Los Angeles, CA 90089 USA Email: [email protected] WWW: http://sunset.usc.edu/ +++++++++++++++++++++++++++++++++++++++++++++++++++++++++

