We are considering using Parquet for storing a matrix that is dense and very, very wide (can have more than 600K columns).
I've following questions: - Is there is a limit on # of columns in Parquet file? We expect to query [10-100] columns at a time using Spark - what are the performance implications in this scenario? - We want a schema-less solution since the matrix can get wider over a period of time - Is there a way to generate such wide structured schema-less Parquet files using map-reduce (input files are in custom binary format)? - HBase can support millions of columns - anyone with prior experience that compares Parquet vs HFile performance for wide structured tables? - Does Impala have support for evolving schema? Krishna
