gpandolfo opened a new issue, #806:
URL: https://github.com/apache/arrow-go/issues/806
### Describe the bug, including details regarding any error messages,
version, and platform.
metadata/statistics_types.gen.go factories (NewInt32Statistics,
NewInt64Statistics, NewFloat32Statistics, NewFloat64Statistics, etc.)
initialize hasDistinctCount: true at construction time, even though no distinct
count has been computed. The result: every Statistics object emitted by
arrow-go's parquet writer carries distinct_count = 0 in the thrift output —
both in ColumnChunk.metadata.statistics (footer) and in
DataPageHeader.statistics (column-data page header).
The Parquet thrift Statistics.distinct_count field is OPTIONAL. Per spec
semantics, the field MUST be absent when the value is unknown — emitting 0
conflates "no distinct values" with "we don't know."
Comparison to parquet-cpp (pyarrow)
parquet-cpp (pyarrow) does NOT exhibit this behavior: distinct_count is set
only when actually computed. For the same input table, pyarrow's footer + page
headers omit the distinct_count field entirely; arrow-go's emit distinct_count:
0. This is a real cross-implementation byte-asymmetry for any consumer who
needs byte-identical Parquet output across language implementations.
Reproducer (minimal Go)
package main
import (
"bytes"
"fmt"
"github.com/apache/arrow-go/v18/arrow"
"github.com/apache/arrow-go/v18/arrow/array"
"github.com/apache/arrow-go/v18/arrow/memory"
"github.com/apache/arrow-go/v18/parquet"
"github.com/apache/arrow-go/v18/parquet/file"
"github.com/apache/arrow-go/v18/parquet/pqarrow"
)
func main() {
pool := memory.NewGoAllocator()
schema := arrow.NewSchema([]arrow.Field{{Name: "id", Type:
arrow.PrimitiveTypes.Int64, Nullable: false}}, nil)
builder := array.NewInt64Builder(pool)
builder.AppendValues([]int64{1, 2, 3}, nil)
arr := builder.NewArray()
rec := array.NewRecord(schema, []arrow.Array{arr}, 3)
var buf bytes.Buffer
props := parquet.NewWriterProperties(parquet.WithStats(true),
parquet.WithCompression(0)) // Uncompressed for clarity
arrowProps := pqarrow.NewArrowWriterProperties()
w, _ := pqarrow.NewFileWriter(schema, &buf, props, arrowProps)
w.Write(rec)
w.Close()
rdr, _ := file.OpenParquetReader(bytes.NewReader(buf.Bytes()), nil)
md := rdr.MetaData()
cm, _ := md.RowGroup(0).ColumnChunk(0)
stats, _ := cm.Statistics()
fmt.Printf("DistinctCount set? %v (value=%d)\n",
stats.HasDistinctCount(), stats.DistinctCount())
// Expected output (after fix): false (value=0)
// Actual output (today): true (value=0)
}
Root cause
metadata/statistics_types.gen.go lines 61-62 (NewInt32Statistics), 362-363
(NewInt64Statistics), 663-664, 942-943, 1235-1236 — each factory sets:
statistics: statistics{
descr: descr,
hasNullCount: true,
hasDistinctCount: true,
...
},
The hasDistinctCount: true flag should NOT default to true; it should flip
only when a caller explicitly invokes SetDistinctCount(...) with a known value.
(hasNullCount: true may need similar review — see "Out of scope" below.)
Proposed fix
In each numeric NewXxxStatistics factory in
metadata/statistics_types.gen.go, change hasDistinctCount: true to
hasDistinctCount: false. The flag flips to true only via explicit
SetDistinctCount() (or IncDistinct() after a real distinct-count calculation).
Backward compatibility
Output bytes change (smaller — distinct_count field removed when unknown).
All readers tolerate the optional-field-absent case (pyarrow already omits,
and all readers handle pyarrow output).
arrow-go's own IsSetDistinctCount() bool method already exists and behaves
correctly given the fix.
The behavior of consumers that explicitly call SetDistinctCount(N) remains
unchanged.
Test
Add a regression test: write a table without computing distinct_count; read
back; assert IsSetDistinctCount() == false.
Cross-impl conformance: produce a Parquet file via arrow-go and via
parquet-cpp on the same input + settings; assert footer thrift bytes match
(currently they don't because of this).
Out of scope (related but separate)
hasNullCount: true at the same construction sites — same factory bug
pattern. Whether to fix in the same PR is a design call. arrow-go and pyarrow
may align on null_count (it's typically computed during the write); flag for
separate review.
pyarrow's missing LogicalType emission knob for INT64 — separate cross-impl
byte-identity issue; tracked at the consumer's spec level (sealed-variant
pattern).
Adjacent precedent
Similar prior fix accepted upstream: apache/arrow-go#210
"fix(parquet/metadata): fix default unsigned int statistics" (merged
2024-12-10). Same code area, same kind of factory-default correction.
PRs welcome from our side if maintainers prefer; happy to follow up.
### Component(s)
Parquet
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]