zeroshade commented on PR #14989:
URL: https://github.com/apache/arrow/pull/14989#issuecomment-1399068130
@minyoung I found the issue and the solution that's not a hack:
in parquet/file/column_writer_types.gen.go.tmpl lines 143 - 147 change this:
```go
if w.bitsBuffer != nil {
w.writeValuesSpaced(vals, info.batchNum, w.bitsBuffer.Bytes(), 0)
} else {
w.writeValuesSpaced(vals, info.batchNum, validBits,
validBitsOffset+valueOffset)
}
```
To this instead:
```go
if w.bitsBuffer != nil {
w.writeValuesSpaced(vals, batch, w.bitsBuffer.Bytes(), 0)
} else {
w.writeValuesSpaced(vals, batch, validBits,
validBitsOffset+valueOffset)
}
```
Note the change from `info.batchNum` to `batch`. It turns out that in this
scenario we should be passing the full batch size to write and not just the
number of raw values it found. This way it properly calculates the number of
nulls in the parent (non-leaf) columns. After you make this change, re-run `go
generate` so that it re-generates all of the writers.
Finally, on line 289 of `pqarrow/column_readers.go`, change the argument to
BuildArray from `validityIO.Read` to `lenBound`. That way it creates the
correctly sized null bitmap buffer.
in my testing that was enough to solve the problem, leave your new tests in
for good measure so this bug doesn't creep back :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]