cccs-jc opened a new issue, #5370: URL: https://github.com/apache/iceberg/issues/5370
Iceberg has functionality to introspect tables. This is very useful for example to check that a column is properly sorted by checking lower/upper bounds https://iceberg.apache.org/docs/latest/spark-queries/#files The return value of this query `SELECT * FROM prod.db.table.files` Will return an upper_bounds and lower_bounds column. A map of column ID to binary. https://iceberg.apache.org/spec/#appendix-d-single-value-serialization To interpret the binary column we register custom UDF functions like this one to convert the bytes in little endian ```python def _to_int(data): return int.from_bytes(data, byteorder='little', signed=True) # register pyspark UDF to_int = F.udf(_to_int, IntegerType()) # register SQL UDF spark.udf.register("to_int", _to_int, IntegerType()) ``` Then we can use this function to interpret the data and display it correctly. ```sql -- Stored as 4-byte little-endian SELECT min(to_int(lower_bounds[1])) min_a, max(to_int(upper_bounds[1])) max_a, min(to_int(lower_bounds[2])) min_b, max(to_int(upper_bounds[2])) max_b, min(to_int(lower_bounds[3])) min_c, max(to_int(upper_bounds[3])) max_c FROM prod.db.table.files ``` Does Iceberg come with utility functions like these. Is there an easier way to interpret the binary data than to write a custom UDF? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
