Hi Andrzej, Thanks a lot for pointing out the features to me. I greatly appreciate the help. Things look a lot better now :)
Just one more thing: Can you point me to any document/email/discussion (internal or published) which can give me some insights about the architecture of Nutch 0.8.x and may be the information on the kind of data that goes in every directory. Thanks, Gaurav Andrzej Bialecki wrote: > > Gaurav Agarwal wrote: >> Hi everyone, >>> Definitely the advantage with 0.8.x is that it models all most every >> operation as a Map-Reduce calls (which is amazing!), and therefore is >> much >> more scalable; but in the absence of the API's mentioned above it does >> not >> provide me much help to build the web-link graph from the crawler output. > > There is a similar API for reading from the DB, which is called > CrawlDbReader. It is relatively simple compared to WebDBReader, because > most of the support is already provided by Hadoop (i.e. the map-reduce > framework). > > In 0.8 and later the information about pages and information about links > are split into two different DB-s - crawldb and linkdb - but exactly the > same information can be obtained from them as before. > > >> >> I may be completely wrong here and please correct me if I am, it looks >> like >> post 0.8.0 release the thrust has been to develop the Nutch project >> completely as an indexing library/application and the crawl module itself >> loosing its independence or decoupling. With 0.8.x, the crawl output in >> itself does not give much of useful information (or at least I failed to >> locate such API's). > > > That's not the case - if anything, the amount of useful information you > can retrieve has tremendously increased. Please see all the tools > available through the bin/nutch script, and prefixed with read* - and > then look at their implementation for inspiration. > > >> >> I'll rephrase my concerns as concrete questions: >> >> 1) Is there a way (API's) in 0.8.x/0.9 release of Nutch to access the >> information about crawled data like : get all pages(contents) given a > > Fetched pages are stored in segments. Please see SegmentReader tool that > allows you to retrieve the segment content. > > >> URL/md5, get outgoing links from a URL and get all incoming links to a > > SegmentReader as above. For incoming links use the linkdb, and > LinkDbReader. > >> URL(this last API is provided; i mentioned it for the sake of >> completeness). >> Or an easy way I can improvise these API's. >> >> 2) If answer to 1 is NO, are there any plans to add these functionality >> back >> in the forthcoming releases. >> >> 3) If answer to both 1 and 2 is NO, can someone point me to the >> discussions >> which explains the rationale behind making these changes to the interface >> which (in my opinion) leaves the crawler module slightly weakened ( I >> tried >> scanning the forum posts till the era when 0.7.2 was released but failed >> to >> locate any such discussion). > > Please see above. The answer is yes. ;) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > -- View this message in context: http://www.nabble.com/0.8.x-Crawler-compared-to-0.7.2-Crawler-tf3475330.html#a9718429 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
