Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Radu Teodorescu Thu, 25 Jun 2020 09:50:59 -0700

Here it is as a pull request:
https://github.com/apache/arrow/pull/7548 
<https://github.com/apache/arrow/pull/7548>


I hope this can be a starter for an active conversation diving into specifics, 
and I look forward to contribute with more design and algorithm ideas as well 
as concrete code.

> On Jun 17, 2020, at 6:11 PM, Neal Richardson <neal.p.richard...@gmail.com> 
> wrote:
> 
> Maybe a draft pull request? If you put "WIP" in the pull request title, CI
> won't run builds on it, so it's suitable for rough outlines and collecting
> feedback.
> 
> Neal
> 
> On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu
> <radukay...@yahoo.com.invalid> wrote:
> 
>> Thank you Wes!
>> Yes, both proposals fit very nicely in your Data Frames vision, I see them
>> as deep dives on some specifics:
>> - the virtual array doc is more fluffy an probably if you agree with the
>> general concept, the next logical move is to put out some interfaces indeed
>> - the random access doc goes into more details and I am curious what you
>> think about some of the concepts
>> 
>> I will follow up shortly with some interfaces - do you prefer references
>> to a repo, inline them in an email or add them as comments to your doc?
>> 
>> 
>>> On Jun 17, 2020, at 4:26 PM, Wes McKinney <wesmck...@gmail.com> wrote:
>>> 
>>> hi Radu,
>>> 
>>> I'll read the proposals in more detail when I can and make comments,
>>> but this has always been something of interest (see, e.g. [1]). The
>>> intent with the "C++ data frames" project that we've discussed (and I
>>> continue to labor towards, e.g. recent compute engine work is directly
>>> in service of this) has always been to be able to express computations
>>> on non-RAM-resident datasets [2]
>>> 
>>> As one initial high level point of discussion, I think what you're
>>> describing in these documents should probably be _new_ C++ classes and
>>> _new_ virtual interfaces, not an evolution of the current arrow::Table
>>> or arrow::Array/ChunkedArray classes. One practical path forward in
>>> terms of discussing implementation issues would be to draft header
>>> files proposing what these new class interfaces look like.
>>> 
>>> - Wes
>>> 
>>> [1]: https://issues.apache.org/jira/browse/ARROW-1329
>>> [2]:
>> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h
>>> 
>>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu
>>> <radukay...@yahoo.com.invalid> wrote:
>>>> 
>>>> Hi folks,
>>>> While I’ve been communicating with some members of this group in the
>> past, this is my first official post so please excuse/correct/guide me as
>> needed.
>>>> 
>>>> Logistics first:
>>>> I put most of the content of my proposals in google doc, but if more
>> appropriate, we can keep the conversation going by email.
>>>> Also the two proposals are pretty independent, so if needed we can
>> break it into two separate email threads, but for now I wanted to keep the
>> spam low
>>>> 
>>>> Actual proposals:
>>>> Virtual Array - The idea is to be able to handle arrow Tables where
>> some of the column data is not (yet) available in memory. For example a
>> Table can map to a parquet file, create VirtualArrays for each column chunk
>> and only read the actual content if and when the Array is touched.
>>>> Virtualize arrow Table <
>> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing
>>> 
>>>> Random Access - I find that “application state” for most large scale
>> systems is compatible with low level vectorized arrow representation and I
>> propose a number of API expansions that would enable thread safe data
>> mutation and efficient random access.
>>>> Arrow arrays random access <
>> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing
>>> 
>>>> Please let me know what you think and what is the best course of action
>> moving forward.
>>>> Thank you
>>>> Radu
>> 
>>

Proposal for arrow DataFrame low level structure and primitives (Was: Two proposals for expanding arrow Table API (virtual arrays and random access))

Reply via email to