frbvianna commented on issue #37976:
URL: https://github.com/apache/arrow/issues/37976#issuecomment-1778312704
Hey @zeroshade, so eventually I got somewhere around estimating values row
by row, column by column. Essentially, estimating the entire record batch size
from each individual value. Sort of tried to combine all of what was discussed
here and using `BufferSpec` layout objects only. I haven't spent much effort
into optimizing the write yet (e.g. by slicing each chunk, then immediately
writing and releasing it).
What I've checked is that it mostly works as an upper-end estimation for
large enough maximum chunk sizes, when many rows are included per chunk and
their slightly overestimated individual sizes add up to compensate for what
seems to be the IPC flatbuffer message header bytes on each chunk record (a
couple kB at most) that I have decided to neglect so far.
Can you please take a look? Any feedback or ideas on how we might have
estimated the flatbuffer header ahead of writing would be much appreciated.
Thank you.
```go
func estimateRowSize(cols []arrow.Array, rowIdx int) uint64 {
var size uint64
for _, col := range cols {
size += estimateRowValueSize(col, rowIdx)
}
return size
}
func estimateRowValueSize(col arrow.Array, rowIdx int) uint64 {
var size uint64
for _, bufSpec := range col.DataType().Layout().Buffers {
switch bufSpec.Kind {
case arrow.KindFixedWidth:
// size of fixedwidth primitive types or
// the varwidth offset type (int32 or int64)
size += uint64(bufSpec.ByteWidth)
case arrow.KindBitmap:
// null indicator bitmap
// upper-end estimation of one byte for each element
size += 1
case arrow.KindVarWidth:
// binary-like variable width types
// element size calculated from buffer offset diff
if bin, ok := col.(array.BinaryLike); ok {
valueSize := bin.ValueOffset64(rowIdx) -
bin.ValueOffset64(rowIdx-1)
size += uint64(valueSize)
}
default:
// arrow.KindAlwaysNull represents zero allocations
continue
}
}
return size
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]