[
https://issues.apache.org/jira/browse/PARQUET-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tianshuo Deng updated PARQUET-343:
----------------------------------
Description:
or really wide schema with sparse data, If a group node is empty, it could have
a huge number of leaves underneath it. Calling writeMull for each leaf every
time when it's ancestor group node is null is in-effcient and is bad for data
locality in the memory especially when the number of leaves is huge.
Instead, null can be cached on the group node. Flushing is only triggered when
a group node becomes non-null from null. This way, all the cached null values
will be flushed to the leaf nodes in a tight loop and improves write
performance.
We tested this approach combined with PARQUET-341 on a really large schema and
gave us ~2X improvement on write performance
was:
For really wide schema with sparse data. If a group node is empty, it could
have a huge number of leaves. Calling write null for each leaf when it's
ancestor group node is null is in-effcient and is bad for data locality in the
memory especially when there is a huge amount of leaves under a group node.
Instead, null can be cached on the group node. Flushing is only triggered when
a group node becomes non-null from null. This way, all the cached null values
will be flushed to the leaf nodes in a tight loop and improves performance.
We tested this approach combined with PARQUET-341 on a really large schema and
gave us ~2X improvement on write performance
> Caching nulls on group node to improve write performance on wide schema
> sparse data
> -----------------------------------------------------------------------------------
>
> Key: PARQUET-343
> URL: https://issues.apache.org/jira/browse/PARQUET-343
> Project: Parquet
> Issue Type: Improvement
> Reporter: Tianshuo Deng
> Assignee: Tianshuo Deng
>
> or really wide schema with sparse data, If a group node is empty, it could
> have a huge number of leaves underneath it. Calling writeMull for each leaf
> every time when it's ancestor group node is null is in-effcient and is bad
> for data locality in the memory especially when the number of leaves is huge.
> Instead, null can be cached on the group node. Flushing is only triggered
> when a group node becomes non-null from null. This way, all the cached null
> values will be flushed to the leaf nodes in a tight loop and improves write
> performance.
> We tested this approach combined with PARQUET-341 on a really large schema
> and gave us ~2X improvement on write performance
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)