jorgecarleitao commented on pull request #8960:
URL: https://github.com/apache/arrow/pull/8960#issuecomment-748441499
@yordan-pavlov
Thanks for the feedback. All great points.
> The performance degradation in the filter u8 is interesting - do you have
a hypothesis for what's causing this?
I am sorry, I was not clear in the PR description:
* `filter u8` is the case of "keep 50%" of the array
* `filter u8 high selectivity` is the case "keep 90%" of the array.
`filter u8` has a 30% degradation, `filter u8 high selectivity` is 2x
faster. Both are single array filters.
> I wonder if this could be explained again by this new implementation doing
more work in advance, which works very well when filtering multiple columns but
is a bit slower when filtering a single column.
The single filter in this PR uses a single pass via an iterator, while
master was building the context on `filter`. So, there is less initial work,
and 2x less memory consumption on the operation, as there is no vector
allocated on `filter`, only on multi filter.
This PR's implementation does perform more work per slot, to minimize the
number of copies. I.e. while the total number of copied bytes is the same in
both implementations, this implementation is guaranteed to call memcopy the
minimum necessary number of times, by "packing" these calls together in a
single call when the calls are contiguous in memory. This implementation is
thus optimized for filters that take contiguous regions, which tends to happen
when there are a lot of 1s, or when the data is distributed in the array in
that way. AFAIK, in real data, this happens more often than by chance, so our
benchmarks are being conservative here.
This behavior (of minimizing the number of calls) is crucial for
non-primitive types because they minimize the number of relocations. The prime
example here is the `StringArray`. There, we can't tell how much to reserve the
"data" buffer from the number of 1s in the filter (it depends on the actual
number of bytes on each slot). Therefore, when building the new array, the
calls ("copy slot [2]", "copy slot [3]") vs "copy slots [2,3]" is slower: the
former can cause up to 2 reallocations, while the latter is up to 1. The more
complex and larger the data structure is, the worse this is.
By grouping these "extend" together, we reduce the number of relocations
when building the new array. This is not so relevant in primitive types because
we know the buffer size from the number of 1s and data type.
The implementation is just performing this computation on the fly (via an
Iterator), so that, in single filter ops, they happen during the build of the
array (and is cached for multi-filter ops).
> highly selective filters (mostly 0s in the filter array)
This PR is calling "highly selective" as filters that have mostly `1s`, as
they "select" many items, but maybe the naming is incorrect? If this naming is
incorrect, could we come up with something other than `selective` at all? IMO
`highly selective` for selecting few items is confusing.
> I also wonder how repeatable the benchmarks are now that they use randomly
generated arrays. What are your observations; are the benchmarks results fairly
stable across multiple runs?
Good point. The randomness is to not make assumptions about the distribution
of the data (e.g. `i % 2 == 0` is highly predictable). We do this in most
benches. The results are reproducible because the seed is set to be a constant
(independent of the thread, machine or time).
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]