[
https://issues.apache.org/jira/browse/PARQUET-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tianshuo Deng updated PARQUET-343:
----------------------------------
Summary: Caching nulls on group node to improve write performance on wide
schema sparse data (was: Caching nulls on group node to improve performance on
wide schema sparse data)
> Caching nulls on group node to improve write performance on wide schema
> sparse data
> -----------------------------------------------------------------------------------
>
> Key: PARQUET-343
> URL: https://issues.apache.org/jira/browse/PARQUET-343
> Project: Parquet
> Issue Type: Improvement
> Reporter: Tianshuo Deng
> Assignee: Tianshuo Deng
>
> For really wide schema with sparse data. If a group node is empty, it could
> have a huge number of leaves. Calling write null for each leaf when it's
> ancestor group node is null is in-effcient and is bad for data locality in
> the memory especially when there is a huge amount of leaves under a group
> node.
> Instead, null can be cached on the group node. Flushing is only triggered
> when a group node becomes non-null from null. This way, all the cached null
> values will be flushed to the leaf nodes in a tight loop and improves
> performance.
> We tested this approach combined with PARQUET-341 on a really large schema
> and gave us ~2X improvement on write performance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)