minyoung commented on PR #14989:
URL: https://github.com/apache/arrow/pull/14989#issuecomment-1381106835
@zeroshade the `+1` is almost certainly a hack. I tried digging into this,
but am not sure what the right solution is.
Disclaimer, my investigation may or may not correct since I'm not super
familiar with the codebase and am making leaps of assumptions at various points.
What I've found so far is:
At the very lowest level, the panic is caused by `firstTimeBitmapWriter`
being given less data than expected
```golang
buf := []byte{0} // only enough buffer for the data provided to AppendWord
bw := utils.NewFirstTimeBitmapWriter(buf, 0, 9) // we say that there will be
9 values
bw.AppendWord(0b11111111, 8) // we give 8 values
bw.Finish() // panic writing the 9'th bit out
```
And that discrepancy is coming from
[defLevelsToBitmapInternal](https://github.com/apache/arrow/blob/go/v10.0.1/go/parquet/file/level_conversion.go#L172-L173):
* e.g. `defLevelsToBitmapInternal(defLevels=[0, 1, 2, 3, 4, 5, 6, 7, 8],
out=&{ValidBits: [0]}, hasRepeatedParent=true)`
* there is only one byte available in `out.ValidBits`
* there are >8 definition levels given, with some values being 0 (meaning
that the root value is null)
* the length of this slice is used to initialize `NewFirstTimeBitmapWriter`
* since `hasRepeatedParent = true`, only a
[subset](https://github.com/apache/arrow/blob/go/v10.0.1/go/parquet/file/level_conversion.go#L151-L163)
of the definition levels are given to the bitmap writer
* this is where `AppendWord` is called with fewer bits than indicated in
`NewFirstTimeBitmapWriter`
So, a thought I had would be to make `firstTimeBitmapWriter` tolerate
getting less data which means having `firstTimeBitmapWriter.Finish` not write
out `bw.curByte` (this is what's ultimately causing the panic) if there's no
value to write out. But I'm unclear how to do that since testing for
`bw.curByte == 0` is almost certainly wrong (0 is a valid value to want to
write out).
The other way then is to increase the available buffer (`out.ValidBits`),
this is where the `+1` comes into play. Well, why are we giving 8 bits when
there are 9 definition levels? This has something to do with the nested
structure of the data. To simplify things a little, we're only dealing with a
single deeply nested leaf column, so there are multiple columns readers all
wrapping the same definition levels used at the leaf.
At the top of the nesting, enough bytes are given to
`file.DefRepLevelsToListInfo`/`file.DefLevelsToBitmap`, where it does not error
and returns the [number of values
read](https://github.com/apache/arrow/blob/go/v10.0.1/go/parquet/file/level_conversion.go#L176).
This [feeds to the next
layer](https://github.com/apache/arrow/blob/go/v10.0.1/go/parquet/pqarrow/column_readers.go#L289),
where we [allocate enough bytes for the indicated number of
values](https://github.com/apache/arrow/blob/go/v10.0.1/go/parquet/pqarrow/column_readers.go#L272).
So, the parent reads 8 values (remember `hasRepeatedParent=true`, so only a
subset is used), the child allocates space for that many values (1 byte), and
we then ultimately end up at `defLevelsToBitmapInternal(defLevels=[0, 1, 2, 3,
4, 5, 6, 7, 8], out=&{ValidBits: [0]}, hasRepeatedParent=true)`.
Maybe the repetition/definition levels should be pruned at each level so
that each child only sees a subset of those values instead of the full set of
values?
Maybe the parent should tell the child to allocate as much space as it was
told to allocate?
The `+1` hack works because `firstTimeBitmapWriter` will at most want to
write one last byte in `Finish` (and this is why we need 8 values to trigger
the panic).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]