Hi there,

I was testing parquet with thrift to see if there would be an interesting
performance gain compared to using just thrift. But in my test I found that
just using plain thrift with lzo compression was faster.


I used a small EMR cluster with 2 m3.xlarge cores.
The sampled input has 9 million records about 1g (on S3) with ~20 fields
and some nested structures and maps. I just do a count on it.
I tried playing with different tuning options but none seemed to really
improve things (the pic shows some global metrics for the different
options).

I also tried with a larger sample about a couple of gigs (output once
compressed), but I had similar results.


In the end the only situation I can see where it can perform significantly
better is when reading few columns from a dataset that has a large number
of columns. But as the schemas are hand written I don't imagine having data
structures with hundreds of columns.


I am wondering if I am doing something wrong (esp. due to the large
difference between plain thrift and parquet+thrift) or if the used dataset
isn't a good fit for parquet?

Thanks!


Cheers,
Eugen


[image: Images intégrées 1]

Reply via email to