To see the impact on different kinds of workloads, it would be good to add a bulk load option to something like YCSB, then run the normal workloads.
-- Sean On Oct 7, 2015 11:58 AM, "Josh Elser" <[email protected]> wrote: > > > Jeff Kubina wrote: > >> Per my thread "How does Accumulo process r-files for bulk ingesting?" >> on the user@ list I would like to test/measure how a lack of >> data-locality of bulk ingested files effects query performance. I seek >> comments/suggestions on the outline of the design for the test: >> >> Outline: >> 1. Create a table and pre-split it to have m tablets where m="total >> tservers". >> 2. Create 1 r-file containing m*n records that evenly distribute >> across the m tablets. >> 3. Bulk ingest the r-file. >> 4. Query each of the split ranges in the table and log their times. >> 5. Compact the table and wait for the compaction to complete. >> 6. Query each of the split ranges in the table and log their times. >> 7. Compute the ratio of the median times from steps 4 and 6. >> >> Questions: >> 1. Instead of compacting the table should I create a new table by >> generating the m r-files whose ranges intersect only one of the >> tablets and bulk ingest them? >> > > If you can be tricky in your non-data-local case to evenly balance the > data, you could just do one table import followed by a compaction and rerun > on the same table. > > You'd just want to make sure you have a decent distribution of the data > across all servers in both the data-local and non-data-local cases > > 2. What is a good size for n, the number of records per tablet server? >> > > I'm wondering if it depends on the type of workload that you're looking to > run. Does it make a difference if you're just running randomized point > queries? Or doing scan over the entire table? > > Assuming you're just doing one tablet per server for your table (it's not > apparent to me if there's a reason that would result in a lesser test), I'd > guess a couple 100MB's worth of records per tablet would be good. Enough to > get a few HDFS blocks per RFile, but not enough that Accumulo would > automatically split it from underneath you. You could also try to increase > the split threshold and put more data per file. >
