laskoviymishka commented on code in PR #932:
URL: https://github.com/apache/iceberg-go/pull/932#discussion_r3199930503


##########
table/internal/parquet_files.go:
##########
@@ -577,7 +577,23 @@ func (p parquetFormat) DataFileStatsFromMeta(meta 
Metadata, statsCols map[int]St
                                panic(err)
                        }
 
-                       fieldID := colMapping[colChunk.PathInSchema().String()]
+                       fieldID, ok := 
colMapping[colChunk.PathInSchema().String()]
+                       if !ok {
+                               // Variant sub-columns (metadata, value) have 
no Iceberg field ID
+                               // (spec: "must not be assigned field IDs"). 
Skip them only if the
+                               // parent column is a variant field.
+                               path := colChunk.PathInSchema()
+                               if len(path) >= 2 {
+                                       parentPath := 
strings.Join(path[:len(path)-1], ".")
+                                       if parentID, hasParent := 
colMapping[parentPath]; hasParent {
+                                               if _, hasStats := 
statsCols[parentID]; !hasStats {

Review Comment:
   I think there's a subtle issue here — the comment above says "skip them only 
if the parent column is a variant field", but the actual gate is `if _, 
hasStats := statsCols[parentID]; !hasStats { continue }`, i.e. "skip if the 
parent has no configured stats". Today the two line up because 
`arrowStatsCollector.Variant` returns no collectors, but if a user sets a 
per-column metric mode on the variant 
(`write.metadata.metrics.column.payload=truncate(16)`), the parent *does* end 
up in `statsCols`, the inner skip is bypassed, and the outer panic on line 595 
fires on every variant write.
   
   Probably worth resolving the parent to the iceberg schema and checking the 
type explicitly:
   ```go
   if parentField, ok := schemaByID[parentID]; ok {
       if _, isVariant := parentField.Type.(iceberg.VariantType); isVariant {
           continue
       }
   }
   ```
   wdyt?



##########
table/substrait/substrait.go:
##########
@@ -170,6 +171,11 @@ func (convertToSubstrait) VisitUnknown() types.Type {
        return nil
 }
 
+func (convertToSubstrait) VisitVariant() types.Type {
+       // Variant has no Substrait equivalent
+       return nil

Review Comment:
   wanted to flag this one because the integration test won't catch it: 
returning `nil` makes `convertToSubstrait.Field` call `.WithNullability(...)` 
on a nil `types.Type` interface for any variant column, which panics. 
`arrow_scanner.go:377` calls `substrait.ConvertExpr(fileSchema, ...)` for *any* 
row filter, not just filters touching variant — so a scan with `EqualTo("id", 
1)` on a primitive sibling column of a table that contains a variant should 
panic at filter conversion time. Same shape as `VisitUnknown` above, but 
unknown isn't readable from data files; variant is.
   
   A couple of ways to handle: emit a placeholder substrait type (binary?) so 
structural projection still works while the existing predicate-rejection guards 
block actual operations *on* the variant, or strip variant columns from the 
schema before `ConvertExpr`. Either is fine, but I'd land one of them in this 
PR — the TODO note for predicate pushdown is still useful, but the scan-time 
panic shouldn't ship.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to