josiahyan commented on pull request #8214:
URL: https://github.com/apache/arrow/pull/8214#issuecomment-696065926


   Yep, I'm worried that the buffers can be changed 
unintentionally/intentionally, and as you pointed out, needs to be always 
checked for and trapped.
   
   I'm wondering if the overhead of these checks are just inherent in such a 
flexible API, and quite significant (for ArrowBuf without flipping the 
`arrow.enable_unsafe_memory_access=false` flag globally, it does consume ~30% 
of CPU when `setSafe`ing), and worth removing for the general use-case. (where 
this by a Builder API that defers generalization, or a special (local) ArrowBuf 
mode if possible).
   
   We could jump into the unsafe APIs, like 
[this](https://github.com/apache/iceberg/issues/9#issuecomment-517486998) 
[code](https://github.com/apache/iceberg/blob/master/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java#L107),
 but a more generic out-of-the-box solution that makes the happy path for 
serialization easy would be nice. Ideally, we'd remove at least ~70% of the CPU 
cost of calling `setSafe` before this patch, for users that only want to append 
data, with an eye towards pushing it down even more given the simpler code.
   
   I'm less familiar with the Arrow code than I'd like to be though (right 
now), and I'm not sure if there are other ways around this that don't involve 
doing `arrow.enable_unsafe_memory_access=true`, and various other tweaks. There 
is  inherent value in keeping only one or two (high level) interfaces for 
building Arrow buffers, and optimizing for the common use case.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to