On Friday 15 July 2011 11:01:09 Julien Nioche wrote: > > Do be honest, i am not. But when reasoning, why would we filter and > > normalize > > everywhere when it's already done in parsing. > > e.g. to use different filters at different stages or simply because the > filtering of outlinks has been added afterwards, who knows?
You're right. The test was actually completely invalid as the injected URL turned out to be a 302! > > Anyway a quick search in Eclipse on > org.apache.nutch.net.URLFIlters.filter() shows that it is called in > ParseOutputFormat when serializing the outlinks (line 227) > > > ... tested.. > > > > I injected a .nl url, generated and fetched. Then i modified urlfilter to > > deny > > everything, did a parse and modified filter again to allow .nl pages. I > > updated the db and it worked. Now i have two urls. > > not clear. Was there only one outlink in that seed? Did the filtering work > or not? > > > More thoughts? :) > > > > On Thursday 14 July 2011 18:31:07 Julien Nioche wrote: > > > Are you sure we don't we already filter and normalize at the end of the > > > parse? (not in front of code - sorry can't check) > > > > > > On 14 July 2011 16:37, Markus Jelsma <[email protected]> wrote: > > > > Hi, > > > > > > > > If we filter and normalize hyperlinks in the parse job, we wouldn't > > > > have > > > > > > to filter and normalize during all other jobs (perhaps except > > > > injector). > > > > > > This would spair a lot of CPU time for updating crawl and link db. It > > > > would also, i > > > > think, help the WebGraph as it operates on segments' ParseData. > > > > > > > > Thoughts? > > > > > > > > Thanks, > > > > -- > > Markus Jelsma - CTO - Openindex > > http://www.linkedin.com/in/markus17 > > 050-8536620 / 06-50258350 -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

