[GitHub] [iceberg] srilman opened a new issue, #4921: Missing Data File Metric Info from Manifest

GitBox Tue, 31 May 2022 13:53:21 -0700


srilman opened a new issue, #4921:
URL: https://github.com/apache/iceberg/issues/4921


   I'm trying to access for of the metric related information (i.e. 
lower_bounds, upper_bounds, distinct_counts) after performing a scan using the 
Java API. I've confirmed (by looking in the manifest files) that these pieces 
of metadata are written. However, the Java API says that all of these values 
are null for some reason.
   
   Minimal Reproducer:
   Spark Code to Generate the Tables:
   ```python
   from datetime import datetime
   import numpy as np
   import pandas as pd
   import pyspark.sql.types as spark_types
   
   def create_table(table_name="test_table"):
       spark = (
           SparkSession.builder.appName("Iceberg with Spark")
           .config(
               "spark.jars.packages",
               "org.apache.iceberg:iceberg-spark-runtime-3.2_2.12:0.13.0",
           )
           .config("spark.sql.catalog.hadoop_prod", 
"org.apache.iceberg.spark.SparkCatalog")
           .config("spark.sql.catalog.hadoop_prod.type", "hadoop")
           .config("spark.sql.catalog.hadoop_prod.warehouse", ".")
           .config(
               "spark.sql.extensions",
               
"org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
           )
           .getOrCreate()
       )
       
       df = pd.DataFrame(
           {
               "index": np.arange(25, dtype=np.int32),
               "dates": pd.Series([datetime.strptime(f"12/11/{2010 + x}", 
"%d/%m/%Y") for x in range(25)]),
           }
       )
   
       schema = spark_types.StructType(
           [
               spark_types.StructField("index", spark_types.IntegerType(), 
False),
               spark_types.StructField("dates",  spark_types.DateType(), False),
           ]
       )
   
       df = spark.createDataFrame(df, schema=schema)
       df.writeTo(f"hadoop_prod.{DATABASE_NAME}.{table_name}").tableProperty(
           "format-version", "2"
       ).tableProperty("write.delete.mode", "merge-on-read").createOrReplace()
   
   
   if __name__ == "__main__":
       create_table()
   ```
   
   Java Code to Read Metadata w/ Table Scan
   ```java
   public class IcebergTester {
     public static void main(String[] args) throws IOException {
       HadoopTables catalog = new HadoopTables();
       System.setProperty("user.dir", ...);
       Table table = catalog.load(...);
       Expression filter = Expressions.greaterThan("index", 10);
       TableScan scan = table.newScan().filter(filter);
       try (CloseableIterable<FileScanTask> fileTasks = scan.planFiles()) {
           for (FileScanTask fileTask : fileTasks) {
               System.out.print("Lower Bounds ");
               System.out.println(fileTask.file().lowerBounds());
               System.out.print("Upper Bounds ");
               System.out.println(fileTask.file().upperBounds());
               System.out.println();
           }
       }
     }
   }
   ```
   
   Output:
   ```
   Lower Bounds null
   Upper Bounds null
   
   Lower Bounds null
   Upper Bounds null
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] srilman opened a new issue, #4921: Missing Data File Metric Info from Manifest

Reply via email to