cccs-jc opened a new issue, #5370:
URL: https://github.com/apache/iceberg/issues/5370

   
   Iceberg has functionality to introspect tables. This is very useful for 
example
   to check that a column is properly sorted by checking lower/upper bounds
   
   https://iceberg.apache.org/docs/latest/spark-queries/#files
   
   The return value of this query `SELECT * FROM prod.db.table.files`
   
   Will return an upper_bounds and lower_bounds column. A map of column ID to 
binary.
   
   https://iceberg.apache.org/spec/#appendix-d-single-value-serialization
   
   To interpret the binary column we register custom UDF functions like this 
one to
   convert the bytes in little endian
   ```python
   def _to_int(data):
       return int.from_bytes(data, byteorder='little', signed=True)
   # register pyspark UDF
   to_int = F.udf(_to_int, IntegerType())
   # register SQL UDF
   spark.udf.register("to_int", _to_int, IntegerType())
   ```
   Then we can use this function to interpret the data and display it correctly.
   ```sql
   -- Stored as 4-byte little-endian
   SELECT
       min(to_int(lower_bounds[1])) min_a,
       max(to_int(upper_bounds[1])) max_a,
       min(to_int(lower_bounds[2])) min_b,
       max(to_int(upper_bounds[2])) max_b,
       min(to_int(lower_bounds[3])) min_c,
       max(to_int(upper_bounds[3])) max_c
   FROM
       prod.db.table.files
   ```  
   Does Iceberg come with utility functions like these. Is there an easier way 
to interpret the binary data than to write a custom UDF?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to