[GitHub] [iceberg] arunb2w opened a new issue, #6406: Overlapping data in data files even after sorting

GitBox Sun, 11 Dec 2022 22:45:37 -0800


arunb2w opened a new issue, #6406:
URL: https://github.com/apache/iceberg/issues/6406


   ### Apache Iceberg version
   
   0.14.0
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   I have performed below steps to analyze table metadata after rewrite based 
on sort strategy.
   
   1) Run rewrite_data_files with sort by _CONTEXT_ID_
   spark.sql(f"CALL {catalog}.system.rewrite_data_files(table => 
'{db}.{table_name}', strategy => 'sort', sort_order => '_CONTEXT_ID_', options 
=> map('rewrite-all','true','max-concurrent-file-group-rewrites', '3'))")
   2) Download the metadata json and manifest avro files from metadata folder 
after the rewrite complete.
   3) Run manifest2json tool - 
[https://github.com/hililiwei/iceberg-tools#manifest2json](https://urldefense.com/v3/__https:/github.com/hililiwei/iceberg-tools*manifest2json__;Iw!!E3l7wfIP!lOMJv7yh1uW4rfKQ8_MWvn-BtfNyThoi6gnQS4uIb9RsvYD7TJjzZCfq40HTxflW6tQfuayTrnPwhvwIjfa5Dd-A$)
  with the metadata json and manifest avro as input.
   4) Parse the json and get lower and upper bound value for the columns that 
are part of metadata metrics. Below is my table config
   .tableProperty("format-version", "2") \
   .tableProperty("read.parquet.vectorization.enabled", "true") \
   .tableProperty("write.metadata.metrics.default", "none") \
   .tableProperty("write.metadata.metrics.column._CONTEXT_ID_", "full") \
   .tableProperty("write.metadata.metrics.column.ID", "full") \
   .tableProperty("write.target-file-size-bytes", "134217728") \
   5) Write it as csv file for further analysis with human readable metadata 
for each parquet file
   
   I have attached the resulting csv file here, if we order it based on context 
we could see it is not respecting the order and it is overlapping in multiple 
files. Even though overlaps can occur when the data is large for a particular 
context they should respect the sort order but that’s not happening here.
   
   Note, that it is not happening for all tables so far i have noticed in large 
tables with more columns in it. Not sure, whether having more number of columns 
in table is causing this. For the attached manifest metadta, this table 
contains 115 columns
   
[manifest_metadata.csv](https://github.com/apache/iceberg/files/10205213/manifest_metadata.csv)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] arunb2w opened a new issue, #6406: Overlapping data in data files even after sorting

Reply via email to