[ 
https://issues.apache.org/jira/browse/ARROW-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112529#comment-17112529
 ] 

David Li commented on ARROW-8875:
---------------------------------

This was implemented in ARROW-8692: 
https://github.com/apache/arrow/commit/9a5d010556ac6b9e30115e41fff281870f48e830

Or do you think there are further optimizations beyond this? (This will be in 
the next release.)

> [C++] use AWS SDK SetResponseStreamFactory to avoid a copy of bytes
> -------------------------------------------------------------------
>
>                 Key: ARROW-8875
>                 URL: https://issues.apache.org/jira/browse/ARROW-8875
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Remi Dettai
>            Priority: Major
>              Labels: C++, S3
>             Fix For: 1.0.0
>
>
> Currently, in `GetObjectRange` of f3fs the `GetObjectRequest` has no 
> `ResponseStreamFactory` assigned. This means that the bytes returned by the 
> S3 API are first sent to a `std::basic_stringbuf`. To my understanding this 
> has two performance impacts:
>  * `std::basic_stringbuf` uses a growing array to buffer the response, so 
> lots of allocations here
>  * on top of that, you have a copy operation from the `std::basic_stringbuf` 
> when data is read into the Arrow buffer.
> This seems to be a bit costly.
> With `ResponseStreamFactory`, we might manage to get the data directly into 
> the Arrow buffer.
> I can take a try at it, but I would need some advice. Is there an existing 
> utility to stream data into an Arrow buffer (if it exists, it is well 
> hidden!) ? or should I stream the data into a plain array and then transfer 
> ownership to Arrow ?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to