Just a try for :

You have a billion urls, where each has a huge page. How to you detect the
duplicate documents?

1. Compute a simple hash for each of the document.
2. And for those docs where the hash matches ,
3.  use a different hash function and go back to step 1..

you might want to stop when u after a certain similarity.

On Thu, May 5, 2011 at 7:33 PM, sourabh jakhar <[email protected]>wrote:

> You have a billion urls, where each has a huge page. How to you detect the
> duplicate documents?
>
> --
> SOURABH JAKHAR,(CSE)(3 year)
> ROOM NO 167 ,
> TILAK,HOSTEL
> 'MNNIT ALLAHABAD
>
>  The Law of Win says, "Let's not do it your way or my way; let's do it the
> best way."
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Algorithm Geeks" group.
> To post to this group, send email to [email protected].
> To unsubscribe from this group, send email to
> [email protected].
> For more options, visit this group at
> http://groups.google.com/group/algogeeks?hl=en.
>



-- 
regards,
chinna.

-- 
You received this message because you are subscribed to the Google Groups 
"Algorithm Geeks" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/algogeeks?hl=en.

Reply via email to