Re: Performance tuning advice

Ryan Blue Fri, 03 Apr 2015 09:03:06 -0700

On 04/02/2015 07:38 AM, Eugen Cepoi wrote:

Hi there,


I was testing parquet with thrift to see if there would be an
interesting performance gain compared to using just thrift. But in my
test I found that just using plain thrift with lzo compression was faster.

This doesn't surprise me too much because of how the Thrift object modelworks. (At least, assuming I understand it right. Feel free to correct me.)

Thrift wants to read and write using the TProtocol, which provides alayer like Parquet's Converters that is an intermediary between theobject model and underlying encodings. Parquet implements TProtocol bybuilding a list of the method calls a record will make to read or writeitself, then allowing the record to read that list. I think this has thepotential to slow down reading and writing.

It's on my todo list to try to get this working using avro-thrift, whichsets the fields directly. That's just to see if it might be fasterconstructing the records directly, since we rely on TProtocol to makeboth thrift and scrooge objects work.

I used a small EMR cluster with 2 m3.xlarge cores.
The sampled input has 9 million records about 1g (on S3) with ~20 fields
and some nested structures and maps. I just do a count on it.
I tried playing with different tuning options but none seemed to really
improve things (the pic shows some global metrics for the different
options).

I also tried with a larger sample about a couple of gigs (output once
compressed), but I had similar results.

Could you post the results of `parquet-tools meta`? I'd like to see whatyour column layout looks like (the final column chunk sizes).

If your data ends up with only a column or two dominating the row groupand you always select those columns, then you probably wouldn't see animprovement. You need at least one "big" column chunk that you're ignoring.


Also, what compression did you use for the Parquet files?

In the end the only situation I can see where it can perform
significantly better is when reading few columns from a dataset that has
a large number of columns. But as the schemas are hand written I don't
imagine having data structures with hundreds of columns.

I think we'll know more from taking a look at the row groups and columnchunk sizes.

I am wondering if I am doing something wrong (esp. due to the large
difference between plain thrift and parquet+thrift) or if the used
dataset isn't a good fit for parquet?

Thanks!


Cheers,
Eugen


rb


--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Performance tuning advice

Reply via email to