Hi, I'm still on 0.7 version, but will soon be trying 0.8. I'm wondering about the following in 0.8:
- What's the best way to get to the anchor data after fetching is done? (or is the best way to collect anchor text during the parsing phase?) Is LinkDbReader the best way to get to this? - Similarly, is LinkDbReader the best way to get to the link graph, or is there a better way? - Is parallel fetching of different sets of links possible now that Nutch uses Hadoop and is distributed? - If so, how does one merge the resulting fetch data? With CrawlDbMerger.main()? Also, is there a way to mark data from each fetch, so one can see which fetch the data is from? Is this the right way to do this? - put some "fetch id" in CrawlDatum during fetch - write custom CrawlDbMerger that is "fetch id" aware and writes that in the final, merged CrawlDb Thanks, Otis Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
