Thanks Karl, I'll be crawling a content management system with 100m+ files in it. I'll use the Alfresco connector for xpath support so I can crawl individual folders instead of the full 100m at a time. As I've done with Alfresco, I'll test and document what I find and what I need to do in order to get this to work. thanks, mark
From: Karl Wright <[email protected]> To: [email protected]; Mark Lugert <[email protected]> Sent: Wednesday, December 12, 2012 4:42 AM Subject: Re: Largest crawl ManifoldCF scales based on how well the underlying database handles two kinds of queries - direct access to a row via an index, and reading from an index in ordered fashion. Both of these go up as log(n) assuming b-trees. I have personally done web crawls on the order of 5 million actual content pages, using a much older version of PostgreSQL (8.3) than is currently available, which is of course not comparable to the numbers you are throwing about. I don't see any reason that you shouldn't attempt a larger crawl of, say, 100M, however, if you have the underlying database and sufficient disk storage for it. But bear in mind a couple of points. (1) At 80 pages per second it will take you a long time to get there (2) You really don't want to be wasting time on excess calculations or refetches, so you want to use an expiration model for your documents, not a rescan, and you want to turn off hop-count filtering if at all possible (3) Try some smaller crawls first in order to get all of your exclusion and inclusion parameters right; you can't afford to go into bad domains or bad content types and then change your mind later (4) Use PostgreSQL; we're currently having trouble with MySQL and HSQLDB in this regard (5) Pay attention to the PostgreSQL tuning parameters in the "how-to-build-and-deploy" section (6) Plan for periodic "VACUUM FULL" operations in order to clean up PostgreSQL dead tuples and restore the database to full speed (7) Use SSDs if possible Thanks, Karl On Wed, Dec 12, 2012 at 1:53 AM, Mark Lugert <[email protected]> wrote: > Anyone know what the largest crawl has been for manifold? 100 million, > billion? > > thanks, > mark
