There is a significant downside to filter and normalize in the parse job as you're losing the original information. But to whom that's not important (i.e. you don't change filters or normalizers often) and makes a lot of CPU cycles this would, i guess, be a very nice optional feature.
On Thursday 14 July 2011 18:21:47 lewis john mcgibbney wrote: > This is quite true Markus. This had actually occurred to me whilst I was > updating the command line options. Initially I was questioning why it would > be necessary to pass -norrmalze arguments when trying to merge crawldb or > segments. It would also provide more value when trying to create the linkdb > as it is an easy mistake to foget to pass the various arguements when doing > it manually. Inevitably it would lead to the duplication of code over some > classes. > > On Thu, Jul 14, 2011 at 4:37 PM, Markus Jelsma > > <[email protected]>wrote: > > Hi, > > > > If we filter and normalize hyperlinks in the parse job, we wouldn't have > > to filter and normalize during all other jobs (perhaps except injector). > > This would spair a lot of CPU time for updating crawl and link db. It > > would also, i > > think, help the WebGraph as it operates on segments' ParseData. > > > > Thoughts? > > > > Thanks, -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536620 / 06-50258350

