arunb2w opened a new issue, #6406:
URL: https://github.com/apache/iceberg/issues/6406
### Apache Iceberg version
0.14.0
### Query engine
Spark
### Please describe the bug 🐞
I have performed below steps to analyze table metadata after rewrite based
on sort strategy.
1) Run rewrite_data_files with sort by _CONTEXT_ID_
spark.sql(f"CALL {catalog}.system.rewrite_data_files(table =>
'{db}.{table_name}', strategy => 'sort', sort_order => '_CONTEXT_ID_', options
=> map('rewrite-all','true','max-concurrent-file-group-rewrites', '3'))")
2) Download the metadata json and manifest avro files from metadata folder
after the rewrite complete.
3) Run manifest2json tool -
[https://github.com/hililiwei/iceberg-tools#manifest2json](https://urldefense.com/v3/__https:/github.com/hililiwei/iceberg-tools*manifest2json__;Iw!!E3l7wfIP!lOMJv7yh1uW4rfKQ8_MWvn-BtfNyThoi6gnQS4uIb9RsvYD7TJjzZCfq40HTxflW6tQfuayTrnPwhvwIjfa5Dd-A$)
with the metadata json and manifest avro as input.
4) Parse the json and get lower and upper bound value for the columns that
are part of metadata metrics. Below is my table config
.tableProperty("format-version", "2") \
.tableProperty("read.parquet.vectorization.enabled", "true") \
.tableProperty("write.metadata.metrics.default", "none") \
.tableProperty("write.metadata.metrics.column._CONTEXT_ID_", "full") \
.tableProperty("write.metadata.metrics.column.ID", "full") \
.tableProperty("write.target-file-size-bytes", "134217728") \
5) Write it as csv file for further analysis with human readable metadata
for each parquet file
I have attached the resulting csv file here, if we order it based on context
we could see it is not respecting the order and it is overlapping in multiple
files. Even though overlaps can occur when the data is large for a particular
context they should respect the sort order but that’s not happening here.
Note, that it is not happening for all tables so far i have noticed in large
tables with more columns in it. Not sure, whether having more number of columns
in table is causing this. For the attached manifest metadta, this table
contains 115 columns
[manifest_metadata.csv](https://github.com/apache/iceberg/files/10205213/manifest_metadata.csv)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]