Jefffrey commented on PR #5110:
URL: https://github.com/apache/arrow-rs/pull/5110#issuecomment-1822578087

   > I worry a bit about the performance impact of this, especially as I think 
for strings it will perform multiple allocations per-value.
   
   The casting back to `ParquetValueType` which would require allocation is 
quite ugly, I agree. `ColumnIndexBuilder` seemed the most natural and easiest 
place to put this boundary order check logic, as it'll be easier to keep track 
as new values are appended, but by that point the values are already pure bytes.
   
   Could try investigate pushing this check logic earlier up the chain, such as 
to here:
   
   
https://github.com/apache/arrow-rs/blob/df69ef57d055453c399fa925ad315d19211d7ab2/parquet/src/column/writer/mod.rs#L612-L667
   
   Since I think `Statistics` would hold the original values as 
`ParquetValueType`.
   
   Alternatively could try find a way to get `ParquetValueType` from bytes 
without allocation (though that kind of behaviour I'd assume would be unsafe to 
some degree).
   
   Could also try make `ColumnIndexBuilder` generic over `ParquetValueType` and 
then cast the values to bytes when building to thrift, after all checks are 
completed.
   
   > Perhaps we might do something simpler whereby the user can specify the 
boundary order via WriterProperties or something? This would also avoid 
potential issues where the boundary order configuration becomes data-dependent, 
which users might find surprising
   
   I'm not sure this should be writer configurable, as if I'm understanding 
correctly then `boundary_order` has to be data-dependent, unless the writer 
knows in advance the sort order for the min/max values of all pages in the 
column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to