Re: Normalize and filter hyperlinks during parse

Markus Jelsma Fri, 15 Jul 2011 04:44:40 -0700


On Friday 15 July 2011 11:01:09 Julien Nioche wrote:
> > Do be honest, i am not. But when reasoning, why would we filter and
> > normalize
> > everywhere when it's already done in parsing.
> 
>  e.g. to use different filters at different stages or simply because the
> filtering of outlinks has been added afterwards, who knows?


You're right. The test was actually completely invalid as the injected URL 
turned out to be a 302! 

> 
> Anyway a quick search in Eclipse on
> org.apache.nutch.net.URLFIlters.filter() shows that it is called in
> ParseOutputFormat when serializing the outlinks (line 227)
> 
> > ... tested..
> > 
> > I injected a .nl url, generated and fetched. Then i modified urlfilter to
> > deny
> > everything, did a parse and modified filter again to allow .nl pages. I
> > updated the db and it worked. Now i have two urls.
> 
> not clear. Was there only one outlink in that seed? Did the filtering work
> or not?
> 
> > More thoughts? :)
> > 
> > On Thursday 14 July 2011 18:31:07 Julien Nioche wrote:
> > > Are you sure we don't we already filter and normalize at the end of the
> > > parse? (not in front of code - sorry can't check)
> > > 
> > > On 14 July 2011 16:37, Markus Jelsma <[email protected]> wrote:
> > > > Hi,
> > > > 
> > > > If we filter and normalize hyperlinks in the parse job, we wouldn't
> > 
> > have
> > 
> > > > to filter and normalize during all other jobs (perhaps except
> > 
> > injector).
> > 
> > > > This would spair a lot of CPU time for updating crawl and link db. It
> > > > would also, i
> > > > think, help the WebGraph as it operates on segments' ParseData.
> > > > 
> > > > Thoughts?
> > > > 
> > > > Thanks,
> > 
> > --
> > Markus Jelsma - CTO - Openindex
> > http://www.linkedin.com/in/markus17
> > 050-8536620 / 06-50258350

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350

Re: Normalize and filter hyperlinks during parse

Reply via email to