maartenbreddels opened a new pull request #8621:
URL: https://github.com/apache/arrow/pull/8621


   There is one obvious loose end in this PR, which is where to generate the 
`std::set` based on the `TrimOptions` (now in the ctor of UTF8TrimBase). I'm 
not sure what the lifetime guarantees are for this object (TrimOptions), where 
it makes sense to initialize this set, and when an (utf8 decoding) error 
occurs, how/where to report this.
   
   Although this is not a costly operation, assuming people don't pass in a 
billion characters to trim, I do wonder what the best approach here is in 
general. It does not make much sense to create the `std::set` at each `Exec` 
call, but that is what happens now. (This also seems to happen in 
`TransformMatchSubstring` for creating the `prefix_table` btw.)
   
   Maybe a good place to put per-kernel pre-compute results are the `*Options` 
objects, but I'm not sure if that makes sense in the current architecture.
   
   Another idea is to explore alternatives to the `std::set`. It seem that 
(based on the TrimManyAscii benchmark), `std::unordered_set` seemed a bit 
slower, and simply using a linear search: 
`std::find(options.characters.begin(), options.characters.end(), c) != 
options.characters.end()` in the predicate instead of the set doesn't seem to 
affect performance that much.
   
   In CPython, a bloom filter is used, I could explore to see if that makes 
sense, but the implementation in Arrow lives under the parquet namespace.
   
   
   
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to