Re: Largest crawl

Mark Lugert Wed, 12 Dec 2012 08:55:58 -0800

Thanks Karl,
 
I'll be crawling a content management system with 100m+ files in it.  I'll use 
the Alfresco connector for xpath support so I can crawl individual folders 
instead of the full 100m at a time. 
 
As I've done with Alfresco, I'll test and document what I find and what I need 
to do in order to get this to work.
 
thanks,
mark

From: Karl Wright <[email protected]>
To: [email protected]; Mark Lugert <[email protected]> 
Sent: Wednesday, December 12, 2012 4:42 AM
Subject: Re: Largest crawl

ManifoldCF scales based on how well the underlying database handles
two kinds of queries - direct access to a row via an index, and
reading from an index in ordered fashion.  Both of these go up as
log(n) assuming b-trees.

I have personally done web crawls on the order of 5 million actual
content pages, using a much older version of PostgreSQL (8.3) than is
currently available, which is of course not comparable to the numbers
you are throwing about.  I don't see any reason that you shouldn't
attempt a larger crawl of, say, 100M, however, if you have the
underlying database and sufficient disk storage for it.  But bear in
mind a couple of points.

(1) At 80 pages per second it will take you a long time to get there
(2) You really don't want to be wasting time on excess calculations or
refetches, so you want to use an expiration model for your documents,
not a rescan, and you want to turn off hop-count filtering if at all
possible
(3) Try some smaller crawls first in order to get all of your
exclusion and inclusion parameters right; you can't afford to go into
bad domains or bad content types and then change your mind later
(4) Use PostgreSQL; we're currently having trouble with MySQL and
HSQLDB in this regard
(5) Pay attention to the PostgreSQL tuning parameters in the
"how-to-build-and-deploy" section
(6) Plan for periodic "VACUUM FULL" operations in order to clean up
PostgreSQL dead tuples and restore the database to full speed
(7) Use SSDs if possible

Thanks,
Karl

On Wed, Dec 12, 2012 at 1:53 AM, Mark Lugert <[email protected]> wrote:
> Anyone know what the largest crawl has been for manifold?  100 million, 
> billion?
>
> thanks,
> mark

Re: Largest crawl

Reply via email to