szehon-ho opened a new pull request #4142:
URL: https://github.com/apache/iceberg/pull/4142


   Was debugging why delete files are not being removed, and wrote this small 
utility to dump the metadata in the console that proved quite helpful for that. 
   
   I was thinking it can help to find in various other problems invoking 
maintenance procedures, for example why are the files/metadata files not being 
compacted, etc.
   
   For small tables it works well, maybe for big tables it will be a bigger 
dump to read through, though it writes in a streaming fashion so should not OOM.
   
   Example:
   
   ```
       ReachableFileUtil.printCurrentSnapshot(table, System.out);
   ```
   
   ```
   \---GenericManifestFile{content=DATA, 
path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/metadata/f3e35c2f-5dd9-4a13-a262-b050b74c248f-m1.avro,
 length=6560, partition_spec_id=0, added_snapshot_id=709093599914280039, 
added_data_files_count=1, added_rows_count=3, existing_data_files_count=0, 
existing_rows_count=0, deleted_data_files_count=0, deleted_rows_count=0, 
partitions=[], specId=0, key_metadata=null, sequence_number=3, 
min_sequence_number=3}
       +---GenericManifestEntry{status=ADDED, snapshot_id=709093599914280039, 
sequence_number=3, file=GenericDataFile{content=data, 
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00000-7-79c9c8a4-9bb0-4b4d-8e34-2fcee70fb357-00001.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=3, 
file_size_in_bytes=654, column_sizes={1=51, 2=57}, value_counts={1=3, 2=3}, 
null_value_counts={1=0, 2=0}, nan_value_counts={}, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0}}
   
   \---GenericManifestFile{content=DATA, 
path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/metadata/f3e35c2f-5dd9-4a13-a262-b050b74c248f-m0.avro,
 length=6602, partition_spec_id=0, added_snapshot_id=709093599914280039, 
added_data_files_count=0, added_rows_count=0, existing_data_files_count=0, 
existing_rows_count=0, deleted_data_files_count=2, deleted_rows_count=4, 
partitions=[], specId=0, key_metadata=null, sequence_number=3, 
min_sequence_number=3}
       +---GenericManifestEntry{status=DELETED, snapshot_id=709093599914280039, 
sequence_number=3, file=GenericDataFile{content=data, 
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00000-0-903b06f3-f54b-471d-891b-44ed7022f671-00001.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=2, 
file_size_in_bytes=644, column_sizes={1=49, 2=51}, value_counts={1=2, 2=2}, 
null_value_counts={1=0, 2=0}, nan_value_counts={}, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0}}
       +---GenericManifestEntry{status=DELETED, snapshot_id=709093599914280039, 
sequence_number=3, file=GenericDataFile{content=data, 
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00001-1-2a15729e-3940-4231-b5ad-47962462acca-00001.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=2, 
file_size_in_bytes=644, column_sizes={1=49, 2=51}, value_counts={1=2, 2=2}, 
null_value_counts={1=0, 2=0}, nan_value_counts={}, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@94446f85, 
key_metadata=null, split_offsets=[4], equality_ids=null, sort_order_id=0}}
   
   \---GenericManifestFile{content=DELETES, 
path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/metadata/f4c479d4-fd21-49fd-9903-636f1fed09ac-m0.avro,
 length=6603, partition_spec_id=0, added_snapshot_id=7445530922696444799, 
added_data_files_count=1, added_rows_count=1, existing_data_files_count=0, 
existing_rows_count=0, deleted_data_files_count=0, deleted_rows_count=0, 
partitions=[], specId=0, key_metadata=null, sequence_number=2, 
min_sequence_number=2}
        +---GenericManifestEntry{status=ADDED, snapshot_id=7445530922696444799, 
sequence_number=2, file=GenericDeleteFile{content=position_deletes, 
file_path=file:/var/folders/wy/5b87_qx57n974szn9_wrn6lw0000gn/T/hive3369326090081521705/table/data/00003-3-d496c17a-f752-4bc8-b555-9ee4a4b006a5-00001.parquet,
 file_format=PARQUET, spec_id=0, partition=PartitionData{}, record_count=1, 
file_size_in_bytes=1670, column_sizes={2147483546=170, 2147483545=46}, 
value_counts={2147483546=1, 2147483545=1}, null_value_counts={2147483546=0, 
2147483545=0}, nan_value_counts={}, 
lower_bounds=org.apache.iceberg.SerializableByteBufferMap@2686ed34, 
upper_bounds=org.apache.iceberg.SerializableByteBufferMap@2686ed34, 
key_metadata=null, split_offsets=null, equality_ids=null, sort_order_id=null}}
   
   
   ```
   
   This also adds a few missing fields to the ManifestFile's toString


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to