On 04/02/2015 07:38 AM, Eugen Cepoi wrote:
Hi there,
I was testing parquet with thrift to see if there would be an
interesting performance gain compared to using just thrift. But in my
test I found that just using plain thrift with lzo compression was faster.
This doesn't surprise me too much because of how the Thrift object model
works. (At least, assuming I understand it right. Feel free to correct me.)
Thrift wants to read and write using the TProtocol, which provides a
layer like Parquet's Converters that is an intermediary between the
object model and underlying encodings. Parquet implements TProtocol by
building a list of the method calls a record will make to read or write
itself, then allowing the record to read that list. I think this has the
potential to slow down reading and writing.
It's on my todo list to try to get this working using avro-thrift, which
sets the fields directly. That's just to see if it might be faster
constructing the records directly, since we rely on TProtocol to make
both thrift and scrooge objects work.
I used a small EMR cluster with 2 m3.xlarge cores.
The sampled input has 9 million records about 1g (on S3) with ~20 fields
and some nested structures and maps. I just do a count on it.
I tried playing with different tuning options but none seemed to really
improve things (the pic shows some global metrics for the different
options).
I also tried with a larger sample about a couple of gigs (output once
compressed), but I had similar results.
Could you post the results of `parquet-tools meta`? I'd like to see what
your column layout looks like (the final column chunk sizes).
If your data ends up with only a column or two dominating the row group
and you always select those columns, then you probably wouldn't see an
improvement. You need at least one "big" column chunk that you're ignoring.
Also, what compression did you use for the Parquet files?
In the end the only situation I can see where it can perform
significantly better is when reading few columns from a dataset that has
a large number of columns. But as the schemas are hand written I don't
imagine having data structures with hundreds of columns.
I think we'll know more from taking a look at the row groups and column
chunk sizes.
I am wondering if I am doing something wrong (esp. due to the large
difference between plain thrift and parquet+thrift) or if the used
dataset isn't a good fit for parquet?
Thanks!
Cheers,
Eugen
rb
--
Ryan Blue
Software Engineer
Cloudera, Inc.