manishmalhotrawork commented on issue #480: Don't group files with different 
partition specs in BaseRewriteManifests
URL: 
https://github.com/apache/incubator-iceberg/issues/480#issuecomment-552770181
 
 
   @aokolnychyi trying to understand the problem and thinking loud for 
solution. 
   
   so, would the fix be to group manifests by partition spec before cluster by 
`clusterFunction`.
   so for group of each partition spec's manifest files, the cluster by logic 
will be applied, where currently its applied for every manifests irrespective 
of the underlying manifests partition spec.
   
   ```              try (ManifestReader reader =
                        
ManifestReader.read(ops.io().newInputFile(manifest.path()), 
ops.current().specsById())) {
                   FilteredManifest filteredManifest = 
reader.select(Arrays.asList("*"));
                   filteredManifest.liveEntries().forEach(
                       entry -> appendEntry(entry, 
clusterByFunc.apply(entry.file()))
                   );
   ```
   Manifest files grouped by partition spec is done in 
`MergingSnapshotProducer` at [this 
line](https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/MergingSnapshotProducer.java#L269).
 So similar logic can be applied.
   
   One more interesting point, 
   as per [this 
line](https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java#L242)
 ManifestFileWriter is created using the `TableOperations.current.spec` which 
will be used to create all `ManifestWriters`, though `ManifestReader` is 
created based on the `partitionSpec` belongs to the Manifest File, as per [this 
line](https://github.com/apache/incubator-iceberg/blob/master/core/src/main/java/org/apache/iceberg/BaseRewriteManifests.java#L176)
   
   Not sure if my understanding is correct or not, So that is also part of the 
problem ?
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to