[jira] [Updated] (PARQUET-343) Caching nulls on group node to improve write performance on wide schema sparse data

Tianshuo Deng (JIRA) Fri, 24 Jul 2015 16:35:47 -0700

     [ 
https://issues.apache.org/jira/browse/PARQUET-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tianshuo Deng updated PARQUET-343:
----------------------------------
    Description: 
or really wide schema with sparse data, If a group node is empty, it could have 
a huge number of leaves underneath it. Calling writeMull for each leaf every 
time when it's ancestor group node is null is in-effcient and is bad for data 
locality in the memory especially when the number of leaves is huge.

Instead, null can be cached on the group node. Flushing is only triggered when 
a group node becomes non-null from null. This way, all the cached null values 
will be flushed to the leaf nodes in a tight loop and improves write 
performance.

We tested this approach combined with PARQUET-341 on a really large schema and 
gave us ~2X improvement on write performance

  was:
For really wide schema with sparse data. If a group node is empty, it could 
have a huge number of leaves. Calling write null for each leaf when it's 
ancestor group node is null is in-effcient and is bad for data locality in the 
memory especially when there is a huge amount of leaves under a group node.

Instead, null can be cached on the group node. Flushing is only triggered when 
a group node becomes non-null from null. This way, all the cached null values 
will be flushed to the leaf nodes in a tight loop and improves performance.

We tested this approach combined with PARQUET-341 on a really large schema and 
gave us ~2X improvement on write performance


> Caching nulls on group node to improve write performance on wide schema 
> sparse data
> -----------------------------------------------------------------------------------
>
>                 Key: PARQUET-343
>                 URL: https://issues.apache.org/jira/browse/PARQUET-343
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Tianshuo Deng
>            Assignee: Tianshuo Deng
>
> or really wide schema with sparse data, If a group node is empty, it could 
> have a huge number of leaves underneath it. Calling writeMull for each leaf 
> every time when it's ancestor group node is null is in-effcient and is bad 
> for data locality in the memory especially when the number of leaves is huge.
> Instead, null can be cached on the group node. Flushing is only triggered 
> when a group node becomes non-null from null. This way, all the cached null 
> values will be flushed to the leaf nodes in a tight loop and improves write 
> performance.
> We tested this approach combined with PARQUET-341 on a really large schema 
> and gave us ~2X improvement on write performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PARQUET-343) Caching nulls on group node to improve write performance on wide schema sparse data

Reply via email to