Just a try for : You have a billion urls, where each has a huge page. How to you detect the duplicate documents?
1. Compute a simple hash for each of the document. 2. And for those docs where the hash matches , 3. use a different hash function and go back to step 1.. you might want to stop when u after a certain similarity. On Thu, May 5, 2011 at 7:33 PM, sourabh jakhar <[email protected]>wrote: > You have a billion urls, where each has a huge page. How to you detect the > duplicate documents? > > -- > SOURABH JAKHAR,(CSE)(3 year) > ROOM NO 167 , > TILAK,HOSTEL > 'MNNIT ALLAHABAD > > The Law of Win says, "Let's not do it your way or my way; let's do it the > best way." > > -- > You received this message because you are subscribed to the Google Groups > "Algorithm Geeks" group. > To post to this group, send email to [email protected]. > To unsubscribe from this group, send email to > [email protected]. > For more options, visit this group at > http://groups.google.com/group/algogeeks?hl=en. > -- regards, chinna. -- You received this message because you are subscribed to the Google Groups "Algorithm Geeks" group. To post to this group, send email to [email protected]. To unsubscribe from this group, send email to [email protected]. For more options, visit this group at http://groups.google.com/group/algogeeks?hl=en.
