Re: Two proposals for expanding arrow Table API (virtual arrays and random access)

Radu Teodorescu Wed, 17 Jun 2020 14:57:15 -0700

Thank you Wes!
Yes, both proposals fit very nicely in your Data Frames vision, I see them as 
deep dives on some specifics:
- the virtual array doc is more fluffy an probably if you agree with the 
general concept, the next logical move is to put out some interfaces indeed
- the random access doc goes into more details and I am curious what you think 
about some of the concepts


I will follow up shortly with some interfaces - do you prefer references to a 
repo, inline them in an email or add them as comments to your doc?
 

> On Jun 17, 2020, at 4:26 PM, Wes McKinney <[email protected]> wrote:
> 
> hi Radu,
> 
> I'll read the proposals in more detail when I can and make comments,
> but this has always been something of interest (see, e.g. [1]). The
> intent with the "C++ data frames" project that we've discussed (and I
> continue to labor towards, e.g. recent compute engine work is directly
> in service of this) has always been to be able to express computations
> on non-RAM-resident datasets [2]
> 
> As one initial high level point of discussion, I think what you're
> describing in these documents should probably be _new_ C++ classes and
> _new_ virtual interfaces, not an evolution of the current arrow::Table
> or arrow::Array/ChunkedArray classes. One practical path forward in
> terms of discussing implementation issues would be to draft header
> files proposing what these new class interfaces look like.
> 
> - Wes
> 
> [1]: https://issues.apache.org/jira/browse/ARROW-1329
> [2]: 
> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
> 
> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
> <[email protected]> wrote:
>> 
>> Hi folks,
>> While I’ve been communicating with some members of this group in the past, 
>> this is my first official post so please excuse/correct/guide me as needed.
>> 
>> Logistics first:
>> I put most of the content of my proposals in google doc, but if more 
>> appropriate, we can keep the conversation going by email.
>> Also the two proposals are pretty independent, so if needed we can break it 
>> into two separate email threads, but for now I wanted to keep the spam low
>> 
>> Actual proposals:
>> Virtual Array - The idea is to be able to handle arrow Tables where some of 
>> the column data is not (yet) available in memory. For example a Table can 
>> map to a parquet file, create VirtualArrays for each column chunk and only 
>> read the actual content if and when the Array is touched.
>> Virtualize arrow Table 
>> <https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing>
>> Random Access - I find that “application state” for most large scale systems 
>> is compatible with low level vectorized arrow representation and I propose a 
>> number of API expansions that would enable thread safe data mutation and 
>> efficient random access.
>> Arrow arrays random access 
>> <https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing>
>> Please let me know what you think and what is the best course of action 
>> moving forward.
>> Thank you
>> Radu

Re: Two proposals for expanding arrow Table API (virtual arrays and random access)

Reply via email to