Jefffrey commented on PR #5110: URL: https://github.com/apache/arrow-rs/pull/5110#issuecomment-1822578087
> I worry a bit about the performance impact of this, especially as I think for strings it will perform multiple allocations per-value. The casting back to `ParquetValueType` which would require allocation is quite ugly, I agree. `ColumnIndexBuilder` seemed the most natural and easiest place to put this boundary order check logic, as it'll be easier to keep track as new values are appended, but by that point the values are already pure bytes. Could try investigate pushing this check logic earlier up the chain, such as to here: https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/parquet/src/column/writer/mod.rs#L612-L667 Since I think `Statistics` would hold the original values as `ParquetValueType`. Alternatively could try find a way to get `ParquetValueType` from bytes without allocation (though that kind of behaviour I'd assume would be unsafe to some degree). Could also try make `ColumnIndexBuilder` generic over `ParquetValueType` and then cast the values to bytes when building to thrift, after all checks are completed. > Perhaps we might do something simpler whereby the user can specify the boundary order via WriterProperties or something? This would also avoid potential issues where the boundary order configuration becomes data-dependent, which users might find surprising I'm not sure this should be writer configurable, as if I'm understanding correctly then `boundary_order` has to be data-dependent, unless the writer knows in advance the sort order for the min/max values of all pages in the column. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
