jorgecarleitao edited a comment on pull request #8960:
URL: https://github.com/apache/arrow/pull/8960#issuecomment-748441499


   @yordan-pavlov 
   
   Thanks for the feedback. All great points.
   
   > The performance degradation in the filter u8 is interesting - do you have 
a hypothesis for what's causing this?
   
   I am sorry, I was not clear in the PR description:
   
   * `filter u8` is the case of "keep 50%" of the array
   * `filter u8 high selectivity` is the case "keep 90%" of the array.
   
   `filter u8` has a 30% degradation, `filter u8 high selectivity` is 2x 
faster. Both are single array filters.
   
   > I wonder if this could be explained again by this new implementation doing 
more work in advance, which works very well when filtering multiple columns but 
is a bit slower when filtering a single column.
   
   The single filter in this PR uses a single pass via an iterator, while 
master was building the context on `filter`. So, there is less initial work, 
and 2x less memory consumption on the operation, as there is no vector 
allocated on `filter`, only on multi filter.
   
   This PR's implementation does perform more work per slot, to minimize the 
number of copies. I.e. while the total number of copied bytes is the same in 
both implementations, this implementation is guaranteed to call memcopy the 
minimum necessary number of times, by "packing" these calls together in a 
single call when the calls are contiguous in memory. This implementation is 
thus optimized for filters that take contiguous regions, which tends to happen 
when there are a lot of 1s, or when the data is distributed in the array in 
that way. AFAIK, in real data, this happens more often than by chance, so our 
benchmarks are being conservative here.
   
   This behavior (of minimizing the number of calls) is crucial for 
non-primitive types because they minimize the number of relocations. The prime 
example here is the `StringArray`. There, we can't tell how much to reserve the 
"data" buffer from the number of 1s in the filter (it depends on the actual 
number of bytes on each slot). Therefore, when building the new array, the 
calls ("copy slot [2]", "copy slot [3]") vs "copy slots [2,3]" is very 
different: the former can cause up to 2 reallocations, while the latter is up 
to 1. The more complex and larger the data structure is, the worse this is.
   
   By grouping these "extend" together, we reduce the number of relocations 
when building the new array. This is not so relevant in primitive types because 
we know the buffer size from the number of 1s and data type.
   
   The implementation is just performing this computation on the fly (via an 
Iterator), so that, in single filter ops, they happen during the build of the 
array (and is cached for multi-filter ops).
   
   > highly selective filters (mostly 0s in the filter array)
   
   This PR is calling "highly selective" as filters that have mostly `1s`, as 
they "select" many items, but maybe the naming is incorrect? If this naming is 
incorrect, could we come up with something other than `selective` at all? IMO 
`highly selective` for selecting few items is confusing.
   
   > I also wonder how repeatable the benchmarks are now that they use randomly 
generated arrays. What are your observations; are the benchmarks results fairly 
stable across multiple runs?
   
   Good point. The randomness is to not make assumptions about the distribution 
of the data (e.g. `i % 2 == 0` is highly predictable). We do this in most 
benches. The results are reproducible because the seed is set to be a constant 
(independent of the thread, machine or time).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to