Re: [Nutch-dev] WebDB Architecture Question -- follow up to Page Scores discussion

Michael Cafarella Mon, 20 Sep 2004 09:26:34 -0700

  Hi Jagdeep,

On Wed, 2004-09-15 at 21:39, Sandhu, Jagdeep wrote:
> Greetings,
> 
> Another issue that I see with the WebDB is the fact that Pages and Links are 
> maintained by URLs and MD5 hashes. In my crawl of 64 million Travel related pages, I 
> have not seen a single example of page duplicate(copy). I am surprised by this fact. 
> Most of the open source software that I download is mirrored at multiple sites. 
> However in travel domain, it seems there are close copies but not exact copies. 
> Therefore I have to use shingling to identify duplicates. The MD5 hashes would not 
> do the trick. I was wondering if other people are experiencing the same issue with 
> simple hash code duplicate detection?
>


  Yeah, this is something I've thought about from time to time. 
(Consider a page where only the date changes; the MD5 won't pick
this up.)  One idea is to divide the page into, say, 32 regions
and compute an MD5 for each.  If 31 of 32 match, then you declare
it a duplicate.  There's actually been some published research in
this area, but I'm not very familiar with it.


> Maintaining WebDB sorted by URLs and MD5 hashes makes the updates to WebDB complex. 
> Mike and Doug have done an excellent job in implementing this complex update 
> mechanism - thank you! As the original Google paper states that it is absolutely 
> essential to avoid disk seeks and perform batch updates to the repository. Thus for 
> substantial sized crawls a relational database would not work - right? Is there any 
> update on the Relational or Object databases that Byron and others where 
> experimenting with? The current design is a good one for large sized installations. 
> Therefore I am thinking about simplifying this design by maintaining the WebDB just 
> by URLs. But it is also important to identify exact duplicates. Hence on fetching a 
> new page, I would compute it's MD5 hash. Then I would use a big multilevel hash 
> table distributed over 4 (or more) nodes to see if the page is a duplicate. <I need 
> to think some more about this problem - I am thinking something like MPI 
> parallelization perhaps might work> For the purpose of computing the authority of 
> the page, I would convert all links to point to the URL of the page existing in the 
> WebDB.  
> 
 
  You are right re: relational dbs.  I'm not optimistic that such
a scheme could work for any but the smallest datasets.

  A large amount of complication in computing the webdb comes from
the assumption we have far more items than can fit in memory.  That's
why we sort all items on disk; that way, we can read two streams of
data off the disk in start-to-end order. and merge or intersect 
them easily.

  Perhaps if you have a titanic hash table, you could modify it 
and get some real speedups.  But we've always assumed one would
not have this.

  Getting rid of MD5s in the WebDB computation will be challenging.
Also keep in mind that MD5s are the "source" of URLs in the link db.

> In principle I agree with Stephan's recommendation of plug-in architecture. The 
> question is how much of a performance penalty can we afford in centralized WebDB 
> updates if we go with plug-in architecture? Plugins for content fetching, parsing is 
> a great idea since these components are already distributed.
> 

  The worrisome word here is "centralized", not necessarily "plug-in".
We really need to make the WebDB computation distributed.  There's code
to do this, but it has not been tested.  (OK, it's my project, so I
shouldn't be using the passive voice there ;)
  
> I owe to Mike and Doug feedback on Nutch Distributed File System testing. So far I 
> have just purchased 4 x Maxtor 250 GB Ultra ATA 133 drives from Costco and a Promise 
> RAID 0 PCI card from Fry's. For less than $800 I have 1 TB of disk storage - is 
> there a way to do better, please advise;) The current NDFS$DataNode class declares a 
> capacity of 60 GB. Should I submit a trivial patch to read the capacity as a 
> NutchConf property over the weekend?  
> 

  I've thought for the last 6-9 months that $1 a gig is a good price. 
Of course, if I've thought it for 9 months, then I'm probably
out of date.  Other people will know better than I will, but it
sounds like you got a pretty good price.
  
  I believe I have converted the capacity to a NutchConf value in an
update I'll submit soon.

  Thanks,
  --Mike


> Thanks in advance for all the great feedback. Cheers,
> 
> --Jagdeep
>  
> 
> 
> 
> 
> -------------------------------------------------------
> This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
> Project Admins to receive an Apple iPod Mini FREE for your judgement on
> who ports your project to Linux PPC the best. Sponsored by IBM.
> Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
> _______________________________________________
> Nutch-developers mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/nutch-developers




-------------------------------------------------------
This SF.Net email is sponsored by: YOU BE THE JUDGE. Be one of 170
Project Admins to receive an Apple iPod Mini FREE for your judgement on
who ports your project to Linux PPC the best. Sponsored by IBM.
Deadline: Sept. 24. Go here: http://sf.net/ppc_contest.php
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] WebDB Architecture Question -- follow up to Page Scores discussion

Reply via email to