HBase crawling into Heritrix

stack Mon, 23 Nov 2009 22:54:15 -0800

Here's a few random notes cut from a private correspondence with Andrew
Purtell on crawling.  With Andrew's permission I'm forwarding to the list.
It might be of use to others trying to crawl into hbase:


"On a 4 node cluster ~190 [Hertitrix] TOE threads pull down ~20MB/sec while
simultaneously a full major compaction on all tables (triggered from shell)
is running. That's about 4.4M URLs/day over 3 Heritrix nodes even under high
CPU, heap, and DFS loading. Cluster handles it smoothly. New experiment,
better tuning. Very low vm.swappiness to tell Linux to sacrifice buffer
pages instead of e.g. JVM heap pages. Jobs are scheduled and packed into
engine instances based on heap use history over the past several hours, so
the engine won't blow up. This config handles insert storms by allowing up
to 20 flushes before the memstore flusher tries to slow things down. LZO
compresson for speed. 1GB region split threshold to hold the number of
regions down for a large write app.

"URL/day capacity seems to scale roughly linear with number of crawler
nodes. The network becomes the bottleneck on this testbed, not HBase.

"Some detail to add is: 100 regionserver IPC handers and similar upping of
ZK connection limits. Also, Heritrix job packing is constrained such that
there's always < 100 TOE threads per host to stay under that limit. Jobs
are for now always built with a fixed thread pool of 10. [Heritrix] Engines
are
running with -Xmx512 so usually can't handle more than 3 jobs concurrently
anyway.

"Also, I created a custom HBase writer based on [hbase-writer] plus the
patch I
contributed for dropping overly large documents in shouldWrite. This is a
requirement because with chunked transfer encoding Heritrix's document
length limit filter doesn't work. I still find that values > ~50MB create
trouble. ~40-50MB is about the line on my testbed. It used to be ~20MB. I
run with 10MB. For my use case, larger files than that are not
interesting because the probability they are [not] malicious is very low.
(Unfortunately this means Heritrix still downloads very large files only
for the writer to throw them away...) Write path is ok for very large
values -- I tested up to 150MB with 0.20-dev. However, for me the RS
would blow up later if I tried to count the table from the shell. I just
could not give them enough heap on my testbed. "

St.Ack

HBase crawling into Heritrix

Reply via email to