We are considering using Parquet for storing a matrix that is dense and
very, very wide (can have more than 600K columns).

I've following questions:

   - Is there is a limit on # of columns in Parquet file? We expect to
   query [10-100] columns at a time using Spark - what are the performance
   implications in this scenario?
   - We want a schema-less solution since the matrix can get wider over a
   period of time
   - Is there a way to generate such wide structured schema-less Parquet
   files using map-reduce (input files are in custom binary format)?
   - HBase can support millions of columns - anyone with prior experience
   that compares Parquet vs HFile performance for wide structured tables?
   - Does Impala have support for evolving schema?

Krishna

Reply via email to