Tianshuo Deng created PARQUET-343:
-------------------------------------

             Summary: Caching nulls on group node to improve performance on 
wide schema sparse data
                 Key: PARQUET-343
                 URL: https://issues.apache.org/jira/browse/PARQUET-343
             Project: Parquet
          Issue Type: Improvement
            Reporter: Tianshuo Deng
            Assignee: Tianshuo Deng


For really wide schema with sparse data. If a group node is empty, it could 
have a huge number of leaves. Calling write null for each leaf when it's 
ancestor group node is null is in-effcient and is bad for data locality in the 
memory especially when there is a huge amount of leaves under a group node.

Instead, null can be cached on the group node. Flushing is only triggered when 
a group node becomes non-null from null. This way, all the cached null values 
will be flushed to the leaf nodes in a tight loop and improves performance.

We tested this approach combined with PARQUET-341 on a really large schema and 
gave us ~2X improvement on write performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to