Re: Issues while Running Apache Phoenix against TPC-H data

Mujtaba Chohan Tue, 23 Aug 2016 12:28:56 -0700

FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G
total/12G HBase heap.


*Data Load*
* 5.5 hours for 600M rows
* Method: Direct CSV load using psql.py script
* # client machines: 1
* Batch size 1K
* Key order: *Sequential*
* 5 parallel psql clients
* No missing rows due to data load

*Schema*
CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY
INTEGER, B.L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null,
B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), C.L_DISCOUNT
DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG  CHAR(1),
C.L_LINESTATUS  CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE,
C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10),
C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey,
l_linenumber));

* 100M guidepost width
* ~42GB data
* Uncompressed/default fast-diff encoded

*Performance with default 128 threadpool*
select count(*) from lineitem_multicf;
+------------+
|  COUNT(1)  |
+------------+
| 600037902  |
+------------+
1 row selected (*24.1** seconds*)

select l_returnflag, l_linestatus,sum(l_quantity) as
sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1 -
l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) * (1
+ l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice) as
avg_price,avg(l_discount) as avg_disc,count(*) as count_order from lineitem
where l_shipdate <= current_date()- 90 group by l_returnflag, l_linestatus
order by l_returnflag, l_linestatus;
4 row selected (*185.2** seconds*)

*Data*
do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.tbl.$i.gz
; done



On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com>
wrote:

> It looks like you guys already have most of the TPCH queries running based
> on Enis’s talk in Ireland this year.  Very cool.
>
> (Slide 20:  Phoenix can execute most of the TPC-H queries!)
>
> Regards,
> John Leach
>
>
> > On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
> >
> > It's TPC-DS, not -H, but this is what I was using way back when to run
> perf
> > tests over Phoenix and the query server while I was developing on it. The
> > first project generates, loads the data via mapreduce and the second tool
> > wraps up use of jmeter to run queries in parallel.
> >
> > https://github.com/ndimiduk/tpcds-gen
> > https://github.com/ndimiduk/phoenix-performance
> >
> > Probably there's dust and bit-rot to brush off of both projects, but
> maybe
> > it'll help someone looking for a starting point?
> >
> > Apologies, but I haven't had time to see what the speakers have shared
> > about their setup.
> >
> > -n
> >
> > On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org> wrote:
> >
> >>> Maybe there's such a test harness that already exists for TPC?
> >>
> >> TPC provides tooling but it's all proprietary. The generated data can be
> >> kept separately (Druid does it at least -
> >> http://druid.io/blog/2014/03/17/benchmarking-druid.html
> >> ).
> >>
> >> I'd say there would be one time setup: generation of data sets of
> various
> >> sizes, conversion to compressed CSV, and upload to somewhere public
> (S3?).
> >> Not strictly necessary, but it would save everyone a lot of time and
> hassle
> >> to not have to download the TPC data generators and munge the output
> every
> >> time. For this one could use the TPC tools.
> >>
> >> Then, the most sensible avenue I think would be implementation of new
> >> Phoenix integration tests that consume that data and run uniquely
> tweaked
> >> queries (yeah - every datastore vendor must do that with TPC). Phoenix
> can
> >> use hbase-it and get the cluster and chaos tooling such as it is for
> free,
> >> but the upsert/initialization/bulk load and query tooling would be all
> >> Phoenix based: the CSV loader, the JDBC driver.
> >>
> >> 
> >> 
> >>
> >>
> >> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <jamestay...@apache.org
> >> <javascript:;>>
> >> wrote:
> >>
> >>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <apurt...@apache.org
> >> <javascript:;>>
> >>> wrote:
> >>>
> >>>>> I have a long interest in 'canned' loadings. Interesting ones are
> >> hard
> >>> to
> >>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try it.
> >>>>
> >>>> Likewise
> >>>>
> >>>>> But I don't want to be the first to try it. I am not a Phoenix
> >> expert.
> >>>>
> >>>> Same here, I'd just email dev@phoenix with a report that TPC query
> XYZ
> >>>> didn't work and that would be as far as I could get.
> >>>>
> >>>> I don't think the first phase would require Phoenix experience. It's
> >> more
> >>> around the automation for running each TPC benchmark so the process is
> >>> repeatable:
> >>> - pulling in the data
> >>> - scripting the jobs
> >>> - having a test harness they run inside
> >>> - identifying the queries that don't work (ideally you wouldn't stop at
> >> the
> >>> first error)
> >>> - filing JIRAs for these
> >>>
> >>> The entire framework could be built and tested using standard JDBC
> APIs,
> >>> and then initially run using MySQL or some other RDBMS before trying it
> >>> with Phoenix. Maybe there's such a test harness that already exists for
> >>> TPC?
> >>>
> >>> Then I think the next phase would require more Phoenix & HBase
> >> experience:
> >>> - tweaking queries where possible given any limitations in Phoenix
> >>> - adding missing syntax (or potentially using the calcite branch which
> >>> supports more)
> >>> - tweaking Phoenix schema declarations to optimize
> >>> - tweaking Phoenix & HBase configs to optimize
> >>> - determining which secondary indexes to add (though I think there's an
> >>> academic paper on this, I can't seem to find it)
> >>>
> >>> Both phases would require a significant amount of time and effort. Each
> >>> benchmark would likely require unique tweaks.
> >>>
> >>> Thanks,
> >>> James
> >>>
> >>
> >>
> >>
> >> --
> >> Best regards,
> >>
> >>   - Andy
> >>
> >> Problems worthy of attack prove their worth by hitting back. - Piet Hein
> >> (via Tom White)
> >>
>
>

Re: Issues while Running Apache Phoenix against TPC-H data

Reply via email to