Dylan Hutchison wrote:
> 2. What is the most effective way to ingest data, if we're receiving data
>> with the size of>1 TB on a daily basis?
>>
>
> If latency is not a primary concern, creating Accumuo RFiles and
> performing bulk ingest/bulk loading is by far the most efficient way to
> getting data into Accumulo. This is often done by a MapReduce job to
> process your incoming data, create Accumulo RFiles and then bulk load these
> files into Accumulo. If you have a low latency for getting data into
> Accumuo, waiting for a MapReduce job to complete may take too long to meet
> your required latencies.
>
>
If you need a lower latency, you still have the option of parallel ingest
via normal BatchWriters. Assuming good load balancing and the same number
of ingestors as tablet servers, you should easily obtain ingest rates of
100k entries/sec/node. With significant effort, some have pushed this to
400k entries/sec/node.
Josh, do we have numbers on bulk ingest rates? I'm curious what the best
rates ever achieved are.
Hrm. Not that I'm aware of. Generally, a bulk import is some ZooKeeper
operations (via FATE) and a few metadata updates per file (~3? i'm not
actually sure). Maybe I'm missing something?
My hunch is that you'd run into HDFS issues in generating the data to
import before you'd run into Accumulo limits. Eventually, compactions
might bog you down too (depending on how you generated the data). I'm
not sure if we even have a bulk-import benchmark (akin to continuous
ingest).