[
https://issues.apache.org/jira/browse/ARROW-11989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515990#comment-17515990
]
Eduardo Ponce commented on ARROW-11989:
---------------------------------------
While implementing in {{GetScalar()}} a lazy initialization of the internal
offsets array used for the binary search, found out that such approach requires
a mutex to check/initialize the offsets array because the immutability property
of {{ChunkedArray}} entails thread-safe access.
So the current solution always initializes the offsets array during
construction. This means that the overhead for creating the offsets is always
observed even if {{GetScalar()}} does not gets called or called very few times.
This is the current performance:
{code}
Time to access example at i=0% : 20.7μs
Time to access example at i=10% : 6.9μs
Time to access example at i=20% : 6.7μs
Time to access example at i=30% : 6.9μs
Time to access example at i=40% : 6.8μs
Time to access example at i=50% : 6.9μs
Time to access example at i=60% : 6.9μs
Time to access example at i=70% : 6.8μs
Time to access example at i=80% : 6.8μs
Time to access example at i=90% : 6.8μs
{code}
The initial overhead can be reduced because at the moment the chunks are
scanned twice during construction:
1. Construct offsets array for binary search
2. Calculate total ChunkedArray length and null count. Actually, the length can
be computed in O(1) with the offsets.
We could combine this by adding a {{length()}} and {{null_count()}} to
{{ChunkResolver}} class. The drawback here is that length and null count are
not necessary for the ChunkResolver operations and can increase data structure
size if variable members are used in ChunkResolver and ChunkedArray.
Experiments do not show a significant difference between the two approaches
choosing the former as it is simpler.
> [C++][Python] Improve ChunkedArray's complexity for the access of elements
> --------------------------------------------------------------------------
>
> Key: ARROW-11989
> URL: https://issues.apache.org/jira/browse/ARROW-11989
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python
> Affects Versions: 3.0.0
> Reporter: quentin lhoest
> Assignee: Eduardo Ponce
> Priority: Major
> Labels: pull-request-available
> Fix For: 8.0.0
>
> Time Spent: 8h
> Remaining Estimate: 0h
>
> Chunked arrays are stored as a C++ vector of Arrays.
> There is currently no indexing structure on top of the vector to allow for
> anything better than O(chunk) to access an arbitrary element.
> For example, with a Table consisting of 1 column “text” defined by:
> - 1024 chunks
> - each chunk is 1024 rows
> - each row is a text of 1024 characters
> Then the time it takes to access one example are:
> {code:java}
> Time to access example at i=0% : 6.7μs
> Time to access example at i=10% : 7.2μs
> Time to access example at i=20% : 9.1μs
> Time to access example at i=30% : 11.4μs
> Time to access example at i=40% : 13.8μs
> Time to access example at i=50% : 16.2μs
> Time to access example at i=60% : 18.7μs
> Time to access example at i=70% : 21.1μs
> Time to access example at i=80% : 26.8μs
> Time to access example at i=90% : 25.2μs
> {code}
> The time measured are the average times to do `table[“text”][j]` depending on
> the index we want to fetch (from the first example at 0% to the example at
> 90% of the length of the table).
> You can take a look at the code that produces this benchmark
> [here|https://pastebin.com/pSkYHQn9].
> Some discussions in [this thread on the mailing
> list|https://lists.apache.org/thread.html/r82d4cb40d72914977bf4c3c5b4c168ea03f6060d24279a44258a6394%40%3Cuser.arrow.apache.org%3E]
> suggested different approaches to improve the complexity:
> - use a contiguous array of chunk lengths, since having a contiguous array of
> lengths makes the iteration over the chunks lengths faster;
> - use a binary search, as in the Julia implementation
> [here|https://github.com/JuliaData/SentinelArrays.jl/blob/fe14a82b815438ee2e04b59bf7f337feb1ffd022/src/chainedvector.jl#L14];
> - use interpolation search.
> Apparently there is also a lookup structure in the compute layer
> [here|https://github.com/apache/arrow/blob/master/cpp/src/arrow/compute/kernels/vector_sort.cc#L94].
> cc [~emkornfield], [~wesm]
> Thanks again for the amazing work !
--
This message was sent by Atlassian Jira
(v8.20.1#820001)