Fokko opened a new issue, #8598:
URL: https://github.com/apache/iceberg/issues/8598

   ### Apache Iceberg version
   
   1.3.1 (latest release)
   
   ### Query engine
   
   None
   
   ### Please describe the bug 🐞
   
   With Iceberg there is some ambiguity round null metrics collections using 
complex types. Let's focus on the `list` first, which illustrates the problem 
very well:
   
   ```
   table {
     1: some_list: optional list<2: int>
   }
   ```
   
   The list itself does not track any statistics, which can be confusing:
   
   
![image](https://github.com/apache/iceberg/assets/1134248/f1849df0-1255-4655-a298-12d35c2badde)
   
   Spark writes each record to a different file (default parallelism of 200).
   
   The correct behavior would be:
   ```sql
   CREATE TABLE s.l1 SELECT array(1,2,3) AS some_list -- Expect: {1: 0, 2: 0}
           UNION ALL SELECT array(1,null,3) AS some_list -- Expect: {1: 0, 2: 1}
           UNION ALL SELECT null AS some_list -- Expect: {1: 1, 2: 0}
   ```
   
   Also check, if you query:
   ```sql
   SELECT * FROM s.l1 WHERE some_list IS NULL
   ```
   it won't push down any optimizations, and just fetches all the files:
   
   ```
   2023-09-20T08:24:04.059 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/metadata/06caa8ee-da4f-42f4-b6f3-51dd26a4dfde-m0.avro 
172.24.0.5       617µs       ⇣  591.333µs  ↑ 141 B ↓ 5.7 KiB
   2023-09-20T08:24:04.173 [200 OK] s3.HeadObject 
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       332µs       ⇣  0s         ↑ 126 B ↓ 0 B
   2023-09-20T08:24:04.178 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       310µs       ⇣  295.75µs  ↑ 141 B ↓ 8 B
   2023-09-20T08:24:04.182 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       469µs       ⇣  448.542µs  ↑ 141 B ↓ 453 B
   2023-09-20T08:24:04.189 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00000-4-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       307µs       ⇣  292.667µs  ↑ 141 B ↓ 546 B
   2023-09-20T08:24:04.193 [200 OK] s3.HeadObject 
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       227µs       ⇣  0s         ↑ 126 B ↓ 0 B
   2023-09-20T08:24:04.196 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       205µs       ⇣  195.666µs  ↑ 141 B ↓ 8 B
   2023-09-20T08:24:04.198 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       326µs       ⇣  311.334µs  ↑ 141 B ↓ 452 B
   2023-09-20T08:24:04.202 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00001-5-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       332µs       ⇣  319.041µs  ↑ 141 B ↓ 542 B
   2023-09-20T08:24:04.208 [200 OK] s3.HeadObject 
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       227µs       ⇣  0s         ↑ 126 B ↓ 0 B
   2023-09-20T08:24:04.210 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       279µs       ⇣  267.333µs  ↑ 141 B ↓ 8 B
   2023-09-20T08:24:04.213 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       467µs       ⇣  449.584µs  ↑ 141 B ↓ 428 B
   2023-09-20T08:24:04.217 [206 Partial Content] s3.GetObject 
warehouse.minio:9000/s/l1/data/00002-6-71bb85a7-d7ac-4499-907c-e7cccc0074db-00001.parquet
 172.24.0.5       289µs       ⇣  274.75µs  ↑ 141 B ↓ 504 B
   ````
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to