Are you sure we don't we already filter and normalize at the end of the parse? (not in front of code - sorry can't check)
On 14 July 2011 16:37, Markus Jelsma <[email protected]> wrote: > Hi, > > If we filter and normalize hyperlinks in the parse job, we wouldn't have to > filter and normalize during all other jobs (perhaps except injector). This > would spair a lot of CPU time for updating crawl and link db. It would > also, i > think, help the WebGraph as it operates on segments' ParseData. > > Thoughts? > > Thanks, > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

