szehon-ho opened a new pull request #4142:
URL: https://github.com/apache/iceberg/pull/4142
Was debugging why delete files are not being removed, and wrote this small
utility to dump the metadata in the console that proved quite helpful for that.
I was thinking it can help to find in various other problems invoking
maintenance procedures, for example why are the files/metadata files not being
compacted, etc.
For small tables it works well, maybe for big tables it will be a bigger
dump to read through, though it writes in a streaming fashion so should not OOM.
Example:
```
ReachableFileUtil.printCurrentSnapshot(table, System.out);
```
```
\---GenericManifestFile{content=DATA,
path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/metadata/f3e35c2f-5dd9-4a13-a262-b050b74c248f-m1.avro,
length=6560, partition_spec_id=0, added_snapshot_id=709093599914280039,
added_data_files_count=1, added_rows_count=3, existing_data_files_count=0,
existing_rows_count=0, deleted_data_files_count=0, deleted_rows_count=0,
partitions=[], specId=0, key_metadata=null, sequence_number=3,
min_sequence_number=3}
+---GenericManifestEntry{status=ADDED, snapshot_id=709093599914280039,
sequence_number=3, file=GenericDataFile{content=data,
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00000-7-79c9c8a4-9bb0-4b4d-8e34-2fcee70fb357-00001.parquet,
file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=3,
file_size_in_bytes=654, column_sizes={1=51, 2=57}, value_counts={1=3, 2=3},
null_value_counts={1=0, 2=0}, nan_value_counts={},
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85,
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85,
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0}}
\---GenericManifestFile{content=DATA,
path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/metadata/f3e35c2f-5dd9-4a13-a262-b050b74c248f-m0.avro,
length=6602, partition_spec_id=0, added_snapshot_id=709093599914280039,
added_data_files_count=0, added_rows_count=0, existing_data_files_count=0,
existing_rows_count=0, deleted_data_files_count=2, deleted_rows_count=4,
partitions=[], specId=0, key_metadata=null, sequence_number=3,
min_sequence_number=3}
+---GenericManifestEntry{status=DELETED, snapshot_id=709093599914280039,
sequence_number=3, file=GenericDataFile{content=data,
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00000-0-903b06f3-f54b-471d-891b-44ed7022f671-00001.parquet,
file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=2,
file_size_in_bytes=644, column_sizes={1=49, 2=51}, value_counts={1=2, 2=2},
null_value_counts={1=0, 2=0}, nan_value_counts={},
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85,
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85,
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0}}
+---GenericManifestEntry{status=DELETED, snapshot_id=709093599914280039,
sequence_number=3, file=GenericDataFile{content=data,
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00001-1-2a15729e-3940-4231-b5ad-47962462acca-00001.parquet,
file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=2,
file_size_in_bytes=644, column_sizes={1=49, 2=51}, value_counts={1=2, 2=2},
null_value_counts={1=0, 2=0}, nan_value_counts={},
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85,
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85,
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0}}
\---GenericManifestFile{content=DELETES,
path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/metadata/f4c479d4-fd21-49fd-9903-636f1fed09ac-m0.avro,
length=6603, partition_spec_id=0, added_snapshot_id=7445530922696444799,
added_data_files_count=1, added_rows_count=1, existing_data_files_count=0,
existing_rows_count=0, deleted_data_files_count=0, deleted_rows_count=0,
partitions=[], specId=0, key_metadata=null, sequence_number=2,
min_sequence_number=2}
+---GenericManifestEntry{status=ADDED, snapshot_id=7445530922696444799,
sequence_number=2, file=GenericDeleteFile{content=position_deletes,
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00003-3-d496c17a-f752-4bc8-b555-9ee4a4b006a5-00001.parquet,
file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=1,
file_size_in_bytes=1670, column_sizes={2147483546=170, 2147483545=46},
value_counts={2147483546=1, 2147483545=1}, null_value_counts={2147483546=0,
2147483545=0}, nan_value_counts={},
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@2686ed34,
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@2686ed34,
key_metadata=null, split_offsets=null, equality_ids=null, sort_order_id=null}}
```
This also adds a few missing fields to the ManifestFile's toString
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]