I'm wondering why don't we have the option to normalize when we merge some segments. It should be similar as mergedb and mergelinkdb.
For instance, let's say i have two urls crawled: http://auto.yahoo.com/index.php?auto=BMW&sort=desc http://auto.yahoo.com/index.php?auto=BMW The page content is the same but the display is different due to the sort parameter. So i don't need to index twice the page. I will then normalize the urls in order to remove some extra parameters (sort=) and thus reduce my duplicate content i.e http://auto.yahoo.com/index.php?auto=BMW&sort=desc will become http://auto.yahoo.com/index.php?auto=BMW This url normalized will be removed when i will merge my crawldb and my linkdb. We should then do it also on the segments. I don't see the point to keep some crawl_generate, parse_data, etc which contains an url which doesn't exist anymore in the crawldb. Maybe am i missing something in this case please help to understand ?
