[ https://issues.apache.org/jira/browse/ARROW-8875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17112883#comment-17112883 ]
Remi Dettai commented on ARROW-8875: ------------------------------------ also the comments line 482:483 do not apply any more I think: [https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L482|https://github.com/apache/arrow/blob/master/cpp/src/arrow/filesystem/s3fs.cc#L383-L392] > [C++] use AWS SDK SetResponseStreamFactory to avoid a copy of bytes > ------------------------------------------------------------------- > > Key: ARROW-8875 > URL: https://issues.apache.org/jira/browse/ARROW-8875 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ > Reporter: Remi Dettai > Priority: Major > Labels: C++, S3 > Fix For: 1.0.0 > > > Currently, in `GetObjectRange` of f3fs the `GetObjectRequest` has no > `ResponseStreamFactory` assigned. This means that the bytes returned by the > S3 API are first sent to a `std::basic_stringbuf`. To my understanding this > has two performance impacts: > * `std::basic_stringbuf` uses a growing array to buffer the response, so > lots of allocations here > * on top of that, you have a copy operation from the `std::basic_stringbuf` > when data is read into the Arrow buffer. > This seems to be a bit costly. > With `ResponseStreamFactory`, we might manage to get the data directly into > the Arrow buffer. > I can take a try at it, but I would need some advice. Is there an existing > utility to stream data into an Arrow buffer (if it exists, it is well > hidden!) ? or should I stream the data into a plain array and then transfer > ownership to Arrow ? -- This message was sent by Atlassian Jira (v8.3.4#803005)