FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G total/12G HBase heap.
*Data Load* * 5.5 hours for 600M rows * Method: Direct CSV load using psql.py script * # client machines: 1 * Batch size 1K * Key order: *Sequential* * 5 parallel psql clients * No missing rows due to data load *Schema* CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY INTEGER, B.L_SUPPKEY INTEGER , L_LINENUMBER INTEGER not null, B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), C.L_DISCOUNT DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG CHAR(1), C.L_LINESTATUS CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE, C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10), C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey, l_linenumber)); * 100M guidepost width * ~42GB data * Uncompressed/default fast-diff encoded *Performance with default 128 threadpool* select count(*) from lineitem_multicf; +------------+ | COUNT(1) | +------------+ | 600037902 | +------------+ 1 row selected (*24.1** seconds*) select l_returnflag, l_linestatus,sum(l_quantity) as sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem where l_shipdate <= current_date()- 90 group by l_returnflag, l_linestatus order by l_returnflag, l_linestatus; 4 row selected (*185.2** seconds*) *Data* do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.tbl.$i.gz ; done On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com> wrote: > It looks like you guys already have most of the TPCH queries running based > on Enis’s talk in Ireland this year. Very cool. > > (Slide 20: Phoenix can execute most of the TPC-H queries!) > > Regards, > John Leach > > > > On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> wrote: > > > > It's TPC-DS, not -H, but this is what I was using way back when to run > perf > > tests over Phoenix and the query server while I was developing on it. The > > first project generates, loads the data via mapreduce and the second tool > > wraps up use of jmeter to run queries in parallel. > > > > https://github.com/ndimiduk/tpcds-gen > > https://github.com/ndimiduk/phoenix-performance > > > > Probably there's dust and bit-rot to brush off of both projects, but > maybe > > it'll help someone looking for a starting point? > > > > Apologies, but I haven't had time to see what the speakers have shared > > about their setup. > > > > -n > > > > On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> wrote: > > > >>> Maybe there's such a test harness that already exists for TPC? > >> > >> TPC provides tooling but it's all proprietary. The generated data can be > >> kept separately (Druid does it at least - > >> http://druid.io/blog/2014/03/17/benchmarking-druid.html > >> ). > >> > >> I'd say there would be one time setup: generation of data sets of > various > >> sizes, conversion to compressed CSV, and upload to somewhere public > (S3?). > >> Not strictly necessary, but it would save everyone a lot of time and > hassle > >> to not have to download the TPC data generators and munge the output > every > >> time. For this one could use the TPC tools. > >> > >> Then, the most sensible avenue I think would be implementation of new > >> Phoenix integration tests that consume that data and run uniquely > tweaked > >> queries (yeah - every datastore vendor must do that with TPC). Phoenix > can > >> use hbase-it and get the cluster and chaos tooling such as it is for > free, > >> but the upsert/initialization/bulk load and query tooling would be all > >> Phoenix based: the CSV loader, the JDBC driver. > >> > >> > >> > >> > >> > >> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <jamestay...@apache.org > >> <javascript:;>> > >> wrote: > >> > >>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <apurt...@apache.org > >> <javascript:;>> > >>> wrote: > >>> > >>>>> I have a long interest in 'canned' loadings. Interesting ones are > >> hard > >>> to > >>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try it. > >>>> > >>>> Likewise > >>>> > >>>>> But I don't want to be the first to try it. I am not a Phoenix > >> expert. > >>>> > >>>> Same here, I'd just email dev@phoenix with a report that TPC query > XYZ > >>>> didn't work and that would be as far as I could get. > >>>> > >>>> I don't think the first phase would require Phoenix experience. It's > >> more > >>> around the automation for running each TPC benchmark so the process is > >>> repeatable: > >>> - pulling in the data > >>> - scripting the jobs > >>> - having a test harness they run inside > >>> - identifying the queries that don't work (ideally you wouldn't stop at > >> the > >>> first error) > >>> - filing JIRAs for these > >>> > >>> The entire framework could be built and tested using standard JDBC > APIs, > >>> and then initially run using MySQL or some other RDBMS before trying it > >>> with Phoenix. Maybe there's such a test harness that already exists for > >>> TPC? > >>> > >>> Then I think the next phase would require more Phoenix & HBase > >> experience: > >>> - tweaking queries where possible given any limitations in Phoenix > >>> - adding missing syntax (or potentially using the calcite branch which > >>> supports more) > >>> - tweaking Phoenix schema declarations to optimize > >>> - tweaking Phoenix & HBase configs to optimize > >>> - determining which secondary indexes to add (though I think there's an > >>> academic paper on this, I can't seem to find it) > >>> > >>> Both phases would require a significant amount of time and effort. Each > >>> benchmark would likely require unique tweaks. > >>> > >>> Thanks, > >>> James > >>> > >> > >> > >> > >> -- > >> Best regards, > >> > >> - Andy > >> > >> Problems worthy of attack prove their worth by hitting back. - Piet Hein > >> (via Tom White) > >> > >