Hi Owen,
This will be a very useful statistic for resource reservation.
A couple of obvious suggestions (to make sure they sound reasonable):
- make this statistic optional or re. Only list and array data types
really need it;
- store the statistic in each stripe footer (stripe-level max instances
per 1024 rows) and file footer (file-level max instances per 1024 rows).
Since ORC files are written primarily with Hive now, how soon can this
statistic be added to Hive's ORC writer?
Thank you,
Aliaksei.
On 09/24/2015 04:40 PM, Owen O'Malley wrote:
All,
While thinking about making resource management for vectorized ORC
readers, one of the difficult points is figuring out how big the vectors
for the nested types need to be. I'd like to propose that we add a
statistic for each column that records the maximum number of instances we
need for each vector row group of 1024 rows.
Having that number would let you set the vector row batch for the complex
types as you are starting each stripe as well as being able to predict how
much memory the reader will need.
Thoughts?
Owen