[ 
https://issues.apache.org/jira/browse/HUDI-4245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin updated HUDI-4245:
----------------------------------
    Description: 
Currently only root-level fields are supported in the Column Stats Index, while 
there's no reason for us not to be able to support nested fields given that 
columnar file formats store nested fields as _nested columns,_ ie as columns 
with a name of the field and corresponding struct it attributes to. 

 

For example following schema: 
{code:java}
c1: StringType
c2: StructType(Seq(StructField("foo", StringType))){code}
Would be stored in Parquet as "c1: string", "c2.foo: string", entailing that 
Parquet actually already collects statistics for all the nested fields and we 
just need to make sure we're propagating them into Column Stats Index

 

Original GH issue:

[https://github.com/apache/hudi/issues/5804#issuecomment-1152983029]

  was:
Currently only root-level fields are supported in the Column Stats Index, while 
there's no reason for us not to be able to support nested fields given that 
columnar file formats store nested fields as _nested columns,_ ie as columns 
with a name of the field and corresponding struct it attributes to. 

 

For example following schema: 
{code:java}
c1: StringType
c2: StructType(Seq(StructField("foo", StringType))){code}
Would be stored in Parquet as "c1: string", "c2.foo: string", entailing that 
Parquet actually already collects statistics for all the nested fields and we 
just need to make sure we're propagating them into Column Stats Index


> Support nested fields in Column Stats Index
> -------------------------------------------
>
>                 Key: HUDI-4245
>                 URL: https://issues.apache.org/jira/browse/HUDI-4245
>             Project: Apache Hudi
>          Issue Type: Bug
>            Reporter: Alexey Kudinkin
>            Priority: Major
>             Fix For: 0.12.0
>
>
> Currently only root-level fields are supported in the Column Stats Index, 
> while there's no reason for us not to be able to support nested fields given 
> that columnar file formats store nested fields as _nested columns,_ ie as 
> columns with a name of the field and corresponding struct it attributes to. 
>  
> For example following schema: 
> {code:java}
> c1: StringType
> c2: StructType(Seq(StructField("foo", StringType))){code}
> Would be stored in Parquet as "c1: string", "c2.foo: string", entailing that 
> Parquet actually already collects statistics for all the nested fields and we 
> just need to make sure we're propagating them into Column Stats Index
>  
> Original GH issue:
> [https://github.com/apache/hudi/issues/5804#issuecomment-1152983029]



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to