Here's a few random notes cut from a private correspondence with Andrew Purtell on crawling. With Andrew's permission I'm forwarding to the list. It might be of use to others trying to crawl into hbase:
"On a 4 node cluster ~190 [Hertitrix] TOE threads pull down ~20MB/sec while simultaneously a full major compaction on all tables (triggered from shell) is running. That's about 4.4M URLs/day over 3 Heritrix nodes even under high CPU, heap, and DFS loading. Cluster handles it smoothly. New experiment, better tuning. Very low vm.swappiness to tell Linux to sacrifice buffer pages instead of e.g. JVM heap pages. Jobs are scheduled and packed into engine instances based on heap use history over the past several hours, so the engine won't blow up. This config handles insert storms by allowing up to 20 flushes before the memstore flusher tries to slow things down. LZO compresson for speed. 1GB region split threshold to hold the number of regions down for a large write app. "URL/day capacity seems to scale roughly linear with number of crawler nodes. The network becomes the bottleneck on this testbed, not HBase. "Some detail to add is: 100 regionserver IPC handers and similar upping of ZK connection limits. Also, Heritrix job packing is constrained such that there's always < 100 TOE threads per host to stay under that limit. Jobs are for now always built with a fixed thread pool of 10. [Heritrix] Engines are running with -Xmx512 so usually can't handle more than 3 jobs concurrently anyway. "Also, I created a custom HBase writer based on [hbase-writer] plus the patch I contributed for dropping overly large documents in shouldWrite. This is a requirement because with chunked transfer encoding Heritrix's document length limit filter doesn't work. I still find that values > ~50MB create trouble. ~40-50MB is about the line on my testbed. It used to be ~20MB. I run with 10MB. For my use case, larger files than that are not interesting because the probability they are [not] malicious is very low. (Unfortunately this means Heritrix still downloads very large files only for the writer to throw them away...) Write path is ok for very large values -- I tested up to 150MB with 0.20-dev. However, for me the RS would blow up later if I tried to count the table from the shell. I just could not give them enough heap on my testbed. " St.Ack
