Ron van der Vegt created NUTCH-2219:
---------------------------------------

             Summary: Dedup script, allow users to change the order in which 
main documents are selected.
                 Key: NUTCH-2219
                 URL: https://issues.apache.org/jira/browse/NUTCH-2219
             Project: Nutch
          Issue Type: New Feature
          Components: crawldb
            Reporter: Ron van der Vegt


Current implementation:

"This command takes a path to a crawldb as parameter and finds duplicates based 
on the signature. If several entries share the same signature, the one with the 
highest score is kept. If the scores are the same, then the fetch time is used 
to determine which one to keep with the most recent one being kept. If their 
fetch times are the same we keep the one with the shortest URL."

The order in which the main document is selected is currently not changeable. 
Therefore I think this option would be nice:
-compareOrder <score>,<fetchTime>,<url>

I have written a patch on trunk (rev 1730516). I'm looking forward for any peer 
review.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to