[jira] [Resolved] (PARQUET-343) Caching nulls on group node to improve write performance on wide schema sparse data

Ryan Blue (JIRA) Fri, 20 Nov 2015 16:25:54 -0800

     [ 
https://issues.apache.org/jira/browse/PARQUET-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ryan Blue resolved PARQUET-343.
-------------------------------
    Resolution: Fixed

This was already fixed in #249.

> Caching nulls on group node to improve write performance on wide schema 
> sparse data
> -----------------------------------------------------------------------------------
>
>                 Key: PARQUET-343
>                 URL: https://issues.apache.org/jira/browse/PARQUET-343
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Tianshuo Deng
>            Assignee: Tianshuo Deng
>
> or really wide schema with sparse data, If a group node is empty, it could 
> have a huge number of leaves underneath it. Calling writeMull for each leaf 
> every time when it's ancestor group node is null is in-effcient and is bad 
> for data locality in the memory especially when the number of leaves is huge.
> Instead, null can be cached on the group node. Flushing is only triggered 
> when a group node becomes non-null from null. This way, all the cached null 
> values will be flushed to the leaf nodes in a tight loop and improves write 
> performance.
> We tested this approach combined with PARQUET-341 on a really large schema 
> and gave us ~2X improvement on write performance



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (PARQUET-343) Caching nulls on group node to improve write performance on wide schema sparse data

Reply via email to