A while ago I posted this on dev list but without reply. I wonder if this is right approach and If I continue to create this feature? Do you think this idea would help nutch or maybe this is dead end and you've already talked about this.
regards Uros Andrzej Bialecki wrote: > Uroš Gruber wrote: >> ParseData.metadata sounds nice, but I think I'm lost again :) >> If I understand code flow the best place would be in Fetcher [262] >> >> but i'm not sure that datum holds info of url being fetched > > On the input to the fetcher you get a URL and a CrawlDatum (originally > coming from the crawldb). Check for example how the segment name is > passed around in metadata, you can use the same method. > Hi, I made some draft patch. But there is still some problems I see. I know code needs to be cleaned and test. But right now I don't know what number set to external urls. For internal linking works great. What is the whole idea of this changes. Injected urls always get hop 0. While fetching/updating/generating hop value is incremented by 1. (still no idea what to do with external link). Then I can add config value max_hop etc. to limit fetcher and generator to create more urls. This way it's possible to limit crawling vertically Comments are welcome. ------------------------------------------------------------------------- Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
