Re: Normalize and filter hyperlinks during parse

Julien Nioche Fri, 15 Jul 2011 02:02:06 -0700

> Do be honest, i am not. But when reasoning, why would we filter and
> normalize
> everywhere when it's already done in parsing.
>


 e.g. to use different filters at different stages or simply because the
filtering of outlinks has been added afterwards, who knows?

Anyway a quick search in Eclipse on org.apache.nutch.net.URLFIlters.filter()
shows that it is called in ParseOutputFormat when serializing the outlinks
(line 227)


> ... tested..
>
> I injected a .nl url, generated and fetched. Then i modified urlfilter to
> deny
> everything, did a parse and modified filter again to allow .nl pages. I
> updated the db and it worked. Now i have two urls.
>

not clear. Was there only one outlink in that seed? Did the filtering work
or not?



>
> More thoughts? :)
>
> On Thursday 14 July 2011 18:31:07 Julien Nioche wrote:
> > Are you sure we don't we already filter and normalize at the end of the
> > parse? (not in front of code - sorry can't check)
> >
> > On 14 July 2011 16:37, Markus Jelsma <[email protected]> wrote:
> > > Hi,
> > >
> > > If we filter and normalize hyperlinks in the parse job, we wouldn't
> have
> > > to filter and normalize during all other jobs (perhaps except
> injector).
> > > This would spair a lot of CPU time for updating crawl and link db. It
> > > would also, i
> > > think, help the WebGraph as it operates on segments' ParseData.
> > >
> > > Thoughts?
> > >
> > > Thanks,
>
> --
> Markus Jelsma - CTO - Openindex
> http://www.linkedin.com/in/markus17
> 050-8536620 / 06-50258350
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Re: Normalize and filter hyperlinks during parse

Reply via email to