[
https://issues.apache.org/jira/browse/ARROW-18113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17621427#comment-17621427
]
Weston Pace edited comment on ARROW-18113 at 10/21/22 12:43 AM:
----------------------------------------------------------------
> Just to be clear: to the filesystem, or on the reader itself?
Oops, I mean on {{RandomAccessFile}}.
> Also, I'm not clear on: "Multiple returned futures may correspond to a single
> read. Or, a single returned future may be a combined result of several
> individual reads." Isn't this saying the same thing twice?
I might call
{noformat}
file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})
{noformat}.
The filesystem could then implement this as:
{noformat}
std::vector<Future> futures;
# The first two futures correspond to the same read
Future<Buffer> coalesced_read = ReadAsync(0, 8);
futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3)));
futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5)));
# The third future corresponds to two reads
Future<Buffer> part_one = ReadAsync(1024, 8Mi);
Future<Buffer> part_two = ReadAsync(1024+8Mi, 8Mi-1024);
futures.push_back(AllComplete({part_one, part_two}).Then(bufs =>
Concatenate(bufs));
{noformat}
was (Author: westonpace):
> Just to be clear: to the filesystem, or on the reader itself?
Oops, I mean on {{RandomAccessFile}}.
> Also, I'm not clear on: "Multiple returned futures may correspond to a single
> read. Or, a single returned future may be a combined result of several
> individual reads." Isn't this saying the same thing twice?
I might call {{file->ReadMany({0, 3}, {3, 8}, {1024, 16Mi})}}.
The filesystem could then implement this as:
{noformat}
std::vector<Future> futures;
# The first two futures correspond to the same read
Future<Buffer> coalesced_read = ReadAsync(0, 8);
futures.push_back(coalesced_read.Then(buf => buf.Split(0, 3)));
futures.push_back(coalesced_read.Then(buf => buf.Split(3, 5)));
# The third future corresponds to two reads
Future<Buffer> part_one = ReadAsync(1024, 8Mi);
Future<Buffer> part_two = ReadAsync(1024+8Mi, 8Mi-1024);
futures.push_back(AllComplete({part_one, part_two}).Then(bufs =>
Concatenate(bufs));
{noformat}
> Implement a read range process without caching
> ----------------------------------------------
>
> Key: ARROW-18113
> URL: https://issues.apache.org/jira/browse/ARROW-18113
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Percy Camilo Triveño Aucahuasi
> Assignee: Percy Camilo Triveño Aucahuasi
> Priority: Major
>
> The current
> [ReadRangeCache|https://github.com/apache/arrow/blob/e06e98db356e602212019cfbae83fd3d5347292d/cpp/src/arrow/io/caching.h#L100]
> is mixing caching with coalescing and making difficult to implement readers
> capable to really perform concurrent reads on coalesced data (see this
> [github
> comment|https://github.com/apache/arrow/pull/14226#discussion_r999334979] for
> additional context); for instance, right now the prebuffering feature of
> those readers cannot handle concurrent invocations.
> The goal for this ticket is to implement a similar component to
> ReadRangeCache for performing non-cache reads (doing only the coalescing part
> instead). So, once we have that new capability, we can port the parquet and
> IPC readers to this new component and keep improving the reading process
> (that would be part of other set of follow-up tickets). Similar ideas were
> mentioned here https://issues.apache.org/jira/browse/ARROW-17599
> Maybe a good place to implement this new capability is inside the file system
> abstraction (as part of a dedicated method to read coalesced data) and where
> the abstract file system can provide a default implementation.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)