Since there are 100 files on which this 600M row data is split. 5 separate
psql script running in parallel on single machine ran that loaded data from
files 1-20, 21-40, 41-60, 61-80, 81-100. Performance get affected as keys
are in sequence in these files which lead to hot-spotting of RS, for this
should also try out Salted <https://phoenix.apache.org/salted.html> Phoenix
table to get the best write performance. Higher batch size might offer
better performance too.

I've also just started data load from single psql client as loading from
single machine can get bottled-necked on network I/O. I'll update on its
result tomorrow and I don't think its time would be too off from loading
using multiple clients.


On Tue, Aug 23, 2016 at 12:49 PM, John Leach <jle...@splicemachine.com>
wrote:

> Mujtaba,
>
> Not following the import process.
>
> The 5 parallel psql clients means that you manually split the data into 5
> buckets/files/directories and then run 5 import scripts simultaneously?
>
> If we wanted to benchmark import performance, what would be the right
> model for that?
>
> Thanks this is very helpful...
>
> Regards,
> John
>
>
>
> > On Aug 23, 2016, at 2:28 PM, Mujtaba Chohan <mujt...@apache.org> wrote:
> >
> > FYI re-loaded with Phoenix 4.8/HBase 0.98.20 on a 8 node cluster with 64G
> > total/12G HBase heap.
> >
> > *Data Load*
> > * 5.5 hours for 600M rows
> > * Method: Direct CSV load using psql.py script
> > * # client machines: 1
> > * Batch size 1K
> > * Key order: *Sequential*
> > * 5 parallel psql clients
> > * No missing rows due to data load
> >
> > *Schema*
> > CREATE TABLE LINEITEM_MULTICF (L_ORDERKEY INTEGER not null, A.L_PARTKEY
> > INTEGER, B.L_SUPPKEY  INTEGER , L_LINENUMBER  INTEGER not null,
> > B.L_QUANTITY DECIMAL(15,2), B.L_EXTENDEDPRICE DECIMAL(15,2), C.L_DISCOUNT
> > DECIMAL(15,2), C.L_TAX DECIMAL(15,2), C.L_RETURNFLAG  CHAR(1),
> > C.L_LINESTATUS  CHAR(1), C.L_SHIPDATE DATE, C.L_COMMITDATE DATE,
> > C.L_RECEIPTDATE DATE, C.L_SHIPINSTRUCT CHAR(25), C.L_SHIPMODE CHAR(10),
> > C.L_COMMENT VARCHAR(44) constraint pk primary key(l_orderkey,
> > l_linenumber));
> >
> > * 100M guidepost width
> > * ~42GB data
> > * Uncompressed/default fast-diff encoded
> >
> > *Performance with default 128 threadpool*
> > select count(*) from lineitem_multicf;
> > +------------+
> > |  COUNT(1)  |
> > +------------+
> > | 600037902  |
> > +------------+
> > 1 row selected (*24.1** seconds*)
> >
> > select l_returnflag, l_linestatus,sum(l_quantity) as
> > sum_qty,sum(l_extendedprice) as sum_base_price,sum(l_extendedprice * (1
> -
> > l_discount)) as sum_disc_price,sum(l_extendedprice * (1 - l_discount) *
> (1
> > + l_tax)) as sum_charge,avg(l_quantity) as avg_qty,avg(l_extendedprice)
> as
> > avg_price,avg(l_discount) as avg_disc,count(*) as count_order from
> lineitem
> > where l_shipdate <= current_date()- 90 group by l_returnflag,
> l_linestatus
> > order by l_returnflag, l_linestatus;
> > 4 row selected (*185.2** seconds*)
> >
> > *Data*
> > do curl -O http://static.druid.io/data/benchmarks/tpch/100/lineitem.
> tbl.$i.gz
> > ; done
> >
> >
> >
> > On Mon, Aug 22, 2016 at 7:48 AM, John Leach <jle...@splicemachine.com>
> > wrote:
> >
> >> It looks like you guys already have most of the TPCH queries running
> based
> >> on Enis’s talk in Ireland this year.  Very cool.
> >>
> >> (Slide 20:  Phoenix can execute most of the TPC-H queries!)
> >>
> >> Regards,
> >> John Leach
> >>
> >>
> >>> On Aug 19, 2016, at 8:28 PM, Nick Dimiduk <ndimi...@gmail.com> wrote:
> >>>
> >>> It's TPC-DS, not -H, but this is what I was using way back when to run
> >> perf
> >>> tests over Phoenix and the query server while I was developing on it.
> The
> >>> first project generates, loads the data via mapreduce and the second
> tool
> >>> wraps up use of jmeter to run queries in parallel.
> >>>
> >>> https://github.com/ndimiduk/tpcds-gen
> >>> https://github.com/ndimiduk/phoenix-performance
> >>>
> >>> Probably there's dust and bit-rot to brush off of both projects, but
> >> maybe
> >>> it'll help someone looking for a starting point?
> >>>
> >>> Apologies, but I haven't had time to see what the speakers have shared
> >>> about their setup.
> >>>
> >>> -n
> >>>
> >>> On Friday, August 19, 2016, Andrew Purtell <apurt...@apache.org>
> wrote:
> >>>
> >>>>> Maybe there's such a test harness that already exists for TPC?
> >>>>
> >>>> TPC provides tooling but it's all proprietary. The generated data can
> be
> >>>> kept separately (Druid does it at least -
> >>>> http://druid.io/blog/2014/03/17/benchmarking-druid.html
> >>>> ​).
> >>>>
> >>>> I'd say there would be one time setup: generation of data sets of
> >> various
> >>>> sizes, conversion to compressed CSV, and upload to somewhere public
> >> (S3?).
> >>>> Not strictly necessary, but it would save everyone a lot of time and
> >> hassle
> >>>> to not have to download the TPC data generators and munge the output
> >> every
> >>>> time. For this one could use the TPC tools.
> >>>>
> >>>> Then, the most sensible avenue I think would be implementation of new
> >>>> Phoenix integration tests that consume that data and run uniquely
> >> tweaked
> >>>> queries (yeah - every datastore vendor must do that with TPC). Phoenix
> >> can
> >>>> use hbase-it and get the cluster and chaos tooling such as it is for
> >> free,
> >>>> but the upsert/initialization/bulk load and query tooling would be all
> >>>> Phoenix based: the CSV loader, the JDBC driver.
> >>>>
> >>>> ​
> >>>> ​
> >>>>
> >>>>
> >>>> On Fri, Aug 19, 2016 at 5:31 PM, James Taylor <jamestay...@apache.org
> >>>> <javascript:;>>
> >>>> wrote:
> >>>>
> >>>>> On Fri, Aug 19, 2016 at 3:01 PM, Andrew Purtell <apurt...@apache.org
> >>>> <javascript:;>>
> >>>>> wrote:
> >>>>>
> >>>>>>> I have a long interest in 'canned' loadings. Interesting ones are
> >>>> hard
> >>>>> to
> >>>>>>> come by. If Phoenix ran any or a subset of TPCs, I'd like to try
> it.
> >>>>>>
> >>>>>> Likewise
> >>>>>>
> >>>>>>> But I don't want to be the first to try it. I am not a Phoenix
> >>>> expert.
> >>>>>>
> >>>>>> Same here, I'd just email dev@phoenix with a report that TPC query
> >> XYZ
> >>>>>> didn't work and that would be as far as I could get.
> >>>>>>
> >>>>>> I don't think the first phase would require Phoenix experience. It's
> >>>> more
> >>>>> around the automation for running each TPC benchmark so the process
> is
> >>>>> repeatable:
> >>>>> - pulling in the data
> >>>>> - scripting the jobs
> >>>>> - having a test harness they run inside
> >>>>> - identifying the queries that don't work (ideally you wouldn't stop
> at
> >>>> the
> >>>>> first error)
> >>>>> - filing JIRAs for these
> >>>>>
> >>>>> The entire framework could be built and tested using standard JDBC
> >> APIs,
> >>>>> and then initially run using MySQL or some other RDBMS before trying
> it
> >>>>> with Phoenix. Maybe there's such a test harness that already exists
> for
> >>>>> TPC?
> >>>>>
> >>>>> Then I think the next phase would require more Phoenix & HBase
> >>>> experience:
> >>>>> - tweaking queries where possible given any limitations in Phoenix
> >>>>> - adding missing syntax (or potentially using the calcite branch
> which
> >>>>> supports more)
> >>>>> - tweaking Phoenix schema declarations to optimize
> >>>>> - tweaking Phoenix & HBase configs to optimize
> >>>>> - determining which secondary indexes to add (though I think there's
> an
> >>>>> academic paper on this, I can't seem to find it)
> >>>>>
> >>>>> Both phases would require a significant amount of time and effort.
> Each
> >>>>> benchmark would likely require unique tweaks.
> >>>>>
> >>>>> Thanks,
> >>>>> James
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Best regards,
> >>>>
> >>>>  - Andy
> >>>>
> >>>> Problems worthy of attack prove their worth by hitting back. - Piet
> Hein
> >>>> (via Tom White)
> >>>>
> >>
> >>
>
>

Reply via email to