Re: [PR] Parquet: derive boundary order when writing [arrow-rs]

via GitHub Wed, 22 Nov 2023 03:17:59 -0800


Jefffrey commented on PR #5110:
URL: https://github.com/apache/arrow-rs/pull/5110#issuecomment-1822578087

> I worry a bit about the performance impact of this, especially as I think
for strings it will perform multiple allocations per-value.

The casting back to `ParquetValueType` which would require allocation is
quite ugly, I agree. `ColumnIndexBuilder` seemed the most natural and easiest
place to put this boundary order check logic, as it'll be easier to keep track
as new values are appended, but by that point the values are already pure bytes.

Could try investigate pushing this check logic earlier up the chain, such as
to here:

https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/parquet/src/column/writer/mod.rs#L612-L667

Since I think `Statistics` would hold the original values as
`ParquetValueType`.

Alternatively could try find a way to get `ParquetValueType` from bytes
without allocation (though that kind of behaviour I'd assume would be unsafe to
some degree).

Could also try make `ColumnIndexBuilder` generic over `ParquetValueType` and
then cast the values to bytes when building to thrift, after all checks are
completed.

> Perhaps we might do something simpler whereby the user can specify the
boundary order via WriterProperties or something? This would also avoid
potential issues where the boundary order configuration becomes data-dependent,
which users might find surprising

I'm not sure this should be writer configurable, as if I'm understanding
correctly then `boundary_order` has to be data-dependent, unless the writer
knows in advance the sort order for the min/max values of all pages in the
column.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Parquet: derive boundary order when writing [arrow-rs]

Reply via email to