gpandolfo opened a new issue, #806:
URL: https://github.com/apache/arrow-go/issues/806

   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   metadata/statistics_types.gen.go factories (NewInt32Statistics, 
NewInt64Statistics, NewFloat32Statistics, NewFloat64Statistics, etc.) 
initialize hasDistinctCount: true at construction time, even though no distinct 
count has been computed. The result: every Statistics object emitted by 
arrow-go's parquet writer carries distinct_count = 0 in the thrift output — 
both in ColumnChunk.metadata.statistics (footer) and in 
DataPageHeader.statistics (column-data page header).
   
   The Parquet thrift Statistics.distinct_count field is OPTIONAL. Per spec 
semantics, the field MUST be absent when the value is unknown — emitting 0 
conflates "no distinct values" with "we don't know."
   
   Comparison to parquet-cpp (pyarrow)
   parquet-cpp (pyarrow) does NOT exhibit this behavior: distinct_count is set 
only when actually computed. For the same input table, pyarrow's footer + page 
headers omit the distinct_count field entirely; arrow-go's emit distinct_count: 
0. This is a real cross-implementation byte-asymmetry for any consumer who 
needs byte-identical Parquet output across language implementations.
   
   Reproducer (minimal Go)
   package main
   
   import (
       "bytes"
       "fmt"
       "github.com/apache/arrow-go/v18/arrow"
       "github.com/apache/arrow-go/v18/arrow/array"
       "github.com/apache/arrow-go/v18/arrow/memory"
       "github.com/apache/arrow-go/v18/parquet"
       "github.com/apache/arrow-go/v18/parquet/file"
       "github.com/apache/arrow-go/v18/parquet/pqarrow"
   )
   
   func main() {
       pool := memory.NewGoAllocator()
       schema := arrow.NewSchema([]arrow.Field{{Name: "id", Type: 
arrow.PrimitiveTypes.Int64, Nullable: false}}, nil)
       builder := array.NewInt64Builder(pool)
       builder.AppendValues([]int64{1, 2, 3}, nil)
       arr := builder.NewArray()
       rec := array.NewRecord(schema, []arrow.Array{arr}, 3)
   
       var buf bytes.Buffer
       props := parquet.NewWriterProperties(parquet.WithStats(true), 
parquet.WithCompression(0)) // Uncompressed for clarity
       arrowProps := pqarrow.NewArrowWriterProperties()
       w, _ := pqarrow.NewFileWriter(schema, &buf, props, arrowProps)
       w.Write(rec)
       w.Close()
   
       rdr, _ := file.OpenParquetReader(bytes.NewReader(buf.Bytes()), nil)
       md := rdr.MetaData()
       cm, _ := md.RowGroup(0).ColumnChunk(0)
       stats, _ := cm.Statistics()
       fmt.Printf("DistinctCount set? %v (value=%d)\n", 
stats.HasDistinctCount(), stats.DistinctCount())
       // Expected output (after fix): false (value=0)
       // Actual output (today):       true  (value=0)
   }
   Root cause
   metadata/statistics_types.gen.go lines 61-62 (NewInt32Statistics), 362-363 
(NewInt64Statistics), 663-664, 942-943, 1235-1236 — each factory sets:
   
   statistics: statistics{
       descr:            descr,
       hasNullCount:     true,
       hasDistinctCount: true,
       ...
   },
   The hasDistinctCount: true flag should NOT default to true; it should flip 
only when a caller explicitly invokes SetDistinctCount(...) with a known value. 
(hasNullCount: true may need similar review — see "Out of scope" below.)
   
   Proposed fix
   In each numeric NewXxxStatistics factory in 
metadata/statistics_types.gen.go, change hasDistinctCount: true to 
hasDistinctCount: false. The flag flips to true only via explicit 
SetDistinctCount() (or IncDistinct() after a real distinct-count calculation).
   
   Backward compatibility
   Output bytes change (smaller — distinct_count field removed when unknown).
   All readers tolerate the optional-field-absent case (pyarrow already omits, 
and all readers handle pyarrow output).
   arrow-go's own IsSetDistinctCount() bool method already exists and behaves 
correctly given the fix.
   The behavior of consumers that explicitly call SetDistinctCount(N) remains 
unchanged.
   Test
   Add a regression test: write a table without computing distinct_count; read 
back; assert IsSetDistinctCount() == false.
   Cross-impl conformance: produce a Parquet file via arrow-go and via 
parquet-cpp on the same input + settings; assert footer thrift bytes match 
(currently they don't because of this).
   Out of scope (related but separate)
   hasNullCount: true at the same construction sites — same factory bug 
pattern. Whether to fix in the same PR is a design call. arrow-go and pyarrow 
may align on null_count (it's typically computed during the write); flag for 
separate review.
   pyarrow's missing LogicalType emission knob for INT64 — separate cross-impl 
byte-identity issue; tracked at the consumer's spec level (sealed-variant 
pattern).
   Adjacent precedent
   Similar prior fix accepted upstream: apache/arrow-go#210 
"fix(parquet/metadata): fix default unsigned int statistics" (merged 
2024-12-10). Same code area, same kind of factory-default correction.
   
   PRs welcome from our side if maintainers prefer; happy to follow up.
   
   ### Component(s)
   
   Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to