Re: Fwd: Data authorization/visibility limit in Accumulo

Josh Elser Sun, 10 Apr 2016 20:33:07 -0700

Dylan Hutchison wrote:

>  2. What is the most effective way to ingest data, if we're receiving data

>>  with the size of>1 TB on a daily basis?
>>

>
>  If latency is not a primary concern, creating Accumuo RFiles and
>  performing bulk ingest/bulk loading is by far the most efficient way to
>  getting data into Accumulo. This is often done by a MapReduce job to
>  process your incoming data, create Accumulo RFiles and then bulk load these
>  files into Accumulo. If you have a low latency for getting data into
>  Accumuo, waiting for a MapReduce job to complete may take too long to meet
>  your required latencies.
>
>

If you need a lower latency, you still have the option of parallel ingest
via normal BatchWriters.  Assuming good load balancing and the same number
of ingestors as tablet servers, you should easily obtain ingest rates of
100k entries/sec/node.  With significant effort, some have pushed this to
400k entries/sec/node.


Josh, do we have numbers on bulk ingest rates?  I'm curious what the best
rates ever achieved are.

Hrm. Not that I'm aware of. Generally, a bulk import is some ZooKeeperoperations (via FATE) and a few metadata updates per file (~3? i'm notactually sure). Maybe I'm missing something?

My hunch is that you'd run into HDFS issues in generating the data toimport before you'd run into Accumulo limits. Eventually, compactionsmight bog you down too (depending on how you generated the data). I'mnot sure if we even have a bulk-import benchmark (akin to continuousingest).

Re: Fwd: Data authorization/visibility limit in Accumulo

Reply via email to