unical1988 commented on PR #650:
URL: https://github.com/apache/incubator-xtable/pull/650#issuecomment-2723169343

   > > Thanks for working on the PR @unical1988, added comments.
   > > There seems to be some confusion about extracting partition values, let 
me know what you think of this.
   > > ```
   > > basePath/ 
   > >                 p1/.. (Can be recursive partitions for parquet files)
   > >                 p2/ ..
   > >                 p3/.. 
   > >                 .hoodie/  (Hudi Metadata)
   > >                 metadata/ (Iceberg metadata) 
   > >                 _delta_log/ (Delta metadata) 
   > > ```
   > > 
   > > 
   > >     
   > >       
   > >     
   > > 
   > >       
   > >     
   > > 
   > >     
   > >   
   > > To extract the partition fields (emphasis on fields here not the actual 
values) we can it in two ways:
   > > 
   > > 1. Assume table is not partitioned, this would just sync the parquet 
files in the target formats using the physical paths you have extracted in one 
of the classes. When you read those tables, partition pruning won't work.
   > > 2. Ask user input (from YAML configuration) for the partition fields 
from the parquet file schema. Many of these analytical datasets  are 
partitioned by date either through an actual date column in the parquet file or 
a timestamp field through which the date is actually extracted.
   > 
   > ```
   > public class InputPartitionColumn {
   >    String fieldName; 
   >    PartitionTransformType transformType;
   > }
   > 
   > InputPartitionKeyConfig should be part of Table object in DatasetConfig.  
   > 
   > 1. No transform -> The values for partition keys in the parquet file are 
concatenated and partitionPath is generated.  Configuring this in InternalTable 
object. 
   > 2. Transformation ->  timestamp -> transform(timestamp) -> 
year/date/month/xyz.parquet
   > ```
   
   I made a slight change to the proposed class InputPartitionColumn  in the 
latest commit, pls check it out!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to