[Nutch-general] Questions: collecting anchors, parallel fetches, link graph

ogjunk-nutch Tue, 27 Jun 2006 10:15:47 -0700

Hi,

I'm still on 0.7 version, but will soon be trying 0.8.  I'm wondering about the 
following in 0.8:


- What's the best way to get to the anchor data after fetching is done? (or is 
the best way to collect anchor text during the parsing phase?)  Is LinkDbReader 
the best way to get to this?

- Similarly, is LinkDbReader the best way to get to the link graph, or is there 
a better way?

- Is parallel fetching of different sets of links possible now that Nutch uses 
Hadoop and is distributed?

- If so, how does one merge the resulting fetch data?  With 
CrawlDbMerger.main()?
  Also, is there a way to mark data from each fetch, so one can see which fetch 
the data is from?
  Is this the right way to do this?
  - put some "fetch id" in CrawlDatum during fetch
  - write custom CrawlDbMerger that is "fetch id" aware and writes that in the 
final, merged CrawlDb

Thanks,
Otis




Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Questions: collecting anchors, parallel fetches, link graph

Reply via email to