Hi, I just started trying out Parquet, and ran into a performance issue. I was using the Avro support to try working with a test schema, using the 'standalone' approach from here: http://blog.cloudera.com/blog/2014/05/how-to-convert-existing-data-into-parquet/
I took an existing Avro schema, consisting of a few columns each containing a map, and wrote, then read back, about 40MB of data using both Avro's own serialisation, and Parquet's. Parquet's ended up being about five times slower. This ratio was maintained when I moved to using ~1GB data. I'd expect it to be a little slower, as I was reading back all columns, but five times seems high. Is there anything simple I might be missing? Thanks Rob
