Re: How to best measure how the lack of data-locality affects query performance

Sean Busbey Wed, 07 Oct 2015 15:28:30 -0700

To see the impact on different kinds of workloads, it would be good to add
a bulk load option to something like YCSB, then run the normal workloads.


-- 
Sean
On Oct 7, 2015 11:58 AM, "Josh Elser" <[email protected]> wrote:

>
>
> Jeff Kubina wrote:
>
>> Per my thread "How does Accumulo process r-files for bulk ingesting?"
>> on the user@ list I would like to test/measure how a lack of
>> data-locality of bulk ingested files effects query performance. I seek
>> comments/suggestions on the outline of the design for the test:
>>
>> Outline:
>> 1. Create a table and pre-split it to have m tablets where m="total
>> tservers".
>> 2. Create 1 r-file containing m*n records that evenly distribute
>> across the m tablets.
>> 3. Bulk ingest the r-file.
>> 4. Query each of the split ranges in the table and log their times.
>> 5. Compact the table and wait for the compaction to complete.
>> 6. Query each of the split ranges in the table and log their times.
>> 7. Compute the ratio of the median times from steps 4 and 6.
>>
>> Questions:
>> 1. Instead of compacting the table should I create a new table by
>> generating the m r-files whose ranges intersect only one of the
>> tablets and bulk ingest them?
>>
>
> If you can be tricky in your non-data-local case to evenly balance the
> data, you could just do one table import followed by a compaction and rerun
> on the same table.
>
> You'd just want to make sure you have a decent distribution of the data
> across all servers in both the data-local and non-data-local cases
>
> 2. What is a good size for n, the number of records per tablet server?
>>
>
> I'm wondering if it depends on the type of workload that you're looking to
> run. Does it make a difference if you're just running randomized point
> queries? Or doing scan over the entire table?
>
> Assuming you're just doing one tablet per server for your table (it's not
> apparent to me if there's a reason that would result in a lesser test), I'd
> guess a couple 100MB's worth of records per tablet would be good. Enough to
> get a few HDFS blocks per RFile, but not enough that Accumulo would
> automatically split it from underneath you. You could also try to increase
> the split threshold and put more data per file.
>

Re: How to best measure how the lack of data-locality affects query performance

Reply via email to