Hi there, I was testing parquet with thrift to see if there would be an interesting performance gain compared to using just thrift. But in my test I found that just using plain thrift with lzo compression was faster.
I used a small EMR cluster with 2 m3.xlarge cores. The sampled input has 9 million records about 1g (on S3) with ~20 fields and some nested structures and maps. I just do a count on it. I tried playing with different tuning options but none seemed to really improve things (the pic shows some global metrics for the different options). I also tried with a larger sample about a couple of gigs (output once compressed), but I had similar results. In the end the only situation I can see where it can perform significantly better is when reading few columns from a dataset that has a large number of columns. But as the schemas are hand written I don't imagine having data structures with hundreds of columns. I am wondering if I am doing something wrong (esp. due to the large difference between plain thrift and parquet+thrift) or if the used dataset isn't a good fit for parquet? Thanks! Cheers, Eugen [image: Images intégrées 1]