Hey Ryan,

2015-04-03 18:00 GMT+02:00 Ryan Blue <b...@cloudera.com>:

> On 04/02/2015 07:38 AM, Eugen Cepoi wrote:
>
>> Hi there,
>>
>> I was testing parquet with thrift to see if there would be an
>> interesting performance gain compared to using just thrift. But in my
>> test I found that just using plain thrift with lzo compression was faster.
>>
>
> This doesn't surprise me too much because of how the Thrift object model
> works. (At least, assuming I understand it right. Feel free to correct me.)
>
> Thrift wants to read and write using the TProtocol, which provides a layer
> like Parquet's Converters that is an intermediary between the object model
> and underlying encodings. Parquet implements TProtocol by building a list
> of the method calls a record will make to read or write itself, then
> allowing the record to read that list. I think this has the potential to
> slow down reading and writing.
>

> It's on my todo list to try to get this working using avro-thrift, which
> sets the fields directly.



Yes I find logic the double "ser/de" overhead, but was not expecting such a
big difference.
I didn't read the code doing the conversion, but with thrift we can
directly set the fields, at least if what you mean is setting without
reflection.
So basically one can just create an "empty" instance via the default ctr
and reflection and then use setFieldValue method with the corresponding
_Field (an enum) and value. We can even reuse those instances.
I think this would perform better than using avro-thrift that adds another
layer. If you can point me to the code of interest I can maybe be of some
help :)

Does the impl based on avro perform much better?



> That's just to see if it might be faster constructing the records
> directly, since we rely on TProtocol to make both thrift and scrooge
> objects work.
>
>  I used a small EMR cluster with 2 m3.xlarge cores.
>> The sampled input has 9 million records about 1g (on S3) with ~20 fields
>> and some nested structures and maps. I just do a count on it.
>> I tried playing with different tuning options but none seemed to really
>> improve things (the pic shows some global metrics for the different
>> options).
>>
>> I also tried with a larger sample about a couple of gigs (output once
>> compressed), but I had similar results.
>>
>
> Could you post the results of `parquet-tools meta`? I'd like to see what
> your column layout looks like (the final column chunk sizes).
>
> If your data ends up with only a column or two dominating the row group
> and you always select those columns, then you probably wouldn't see an
> improvement. You need at least one "big" column chunk that you're ignoring.
>
>
I'll provide those shortly. BTW I had some warnings indicating that it
couldn't skip row groups due to predicates or something like this. I'll try
to provide it too.


> Also, what compression did you use for the Parquet files?
>

Lzo, it is also the one I am using for the raw thrift data.

Thank you!
Eugen



>
>  In the end the only situation I can see where it can perform
>> significantly better is when reading few columns from a dataset that has
>> a large number of columns. But as the schemas are hand written I don't
>> imagine having data structures with hundreds of columns.
>>
>
> I think we'll know more from taking a look at the row groups and column
> chunk sizes.
>
>
>  I am wondering if I am doing something wrong (esp. due to the large
>> difference between plain thrift and parquet+thrift) or if the used
>> dataset isn't a good fit for parquet?
>>
>> Thanks!
>>
>>
>> Cheers,
>> Eugen
>>
>
> rb
>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Reply via email to