[ 
https://issues.apache.org/jira/browse/PARQUET-1987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17288046#comment-17288046
 ] 

Carlos Diogo commented on PARQUET-1987:
---------------------------------------

Looking at the mentioned discussions it seem that the ability to columns groups 
in different files , the same way row groups are in different files could have 
a wide application advantage.

For me , the use case is pretty simple. If i have 1000 columns in a parquet 
schema and i split it in files of 100 columns and then i need to "update" a 
single one, i only need to re-write a file with 1/10 of the data. Furthermore 
this would not have any impact on columns compression of page stats.

This is a critical use case (slow changing dimensions with low cardinality) 
that both Delta, Iceberg or Hudi fail to address  as all of them require a full 
row to be rewritten. 

Nevertheless , the question is only then if the statement on the Parquet home 
page is accurate as i assume that the functionality is far from being 
completed. 

> Document how a schema can have columns splitted over different files 
> ---------------------------------------------------------------------
>
>                 Key: PARQUET-1987
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1987
>             Project: Parquet
>          Issue Type: Improvement
>            Reporter: Carlos Diogo
>            Priority: Minor
>
> In the design overview is stated that the format supports that columns of a 
> given schema are splitted over several files . 
> To date ,other than the reference in the parquet homepage there are no other 
> references to these feature , not even if it is actually implemented. 
> As it currently stands , I believe that for a given row group you cannot have 
> some columns in one file and other column group in another file belonging to 
> the same schema .
> If these feature was actually implemented it would open the door to faster 
> data updates which affect a single column in very wide tables, which is 
> typical on big data use cases .
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to