Here it is as a pull request: https://github.com/apache/arrow/pull/7548 <https://github.com/apache/arrow/pull/7548>
I hope this can be a starter for an active conversation diving into specifics, and I look forward to contribute with more design and algorithm ideas as well as concrete code. > On Jun 17, 2020, at 6:11 PM, Neal Richardson <neal.p.richard...@gmail.com> > wrote: > > Maybe a draft pull request? If you put "WIP" in the pull request title, CI > won't run builds on it, so it's suitable for rough outlines and collecting > feedback. > > Neal > > On Wed, Jun 17, 2020 at 2:57 PM Radu Teodorescu > <radukay...@yahoo.com.invalid> wrote: > >> Thank you Wes! >> Yes, both proposals fit very nicely in your Data Frames vision, I see them >> as deep dives on some specifics: >> - the virtual array doc is more fluffy an probably if you agree with the >> general concept, the next logical move is to put out some interfaces indeed >> - the random access doc goes into more details and I am curious what you >> think about some of the concepts >> >> I will follow up shortly with some interfaces - do you prefer references >> to a repo, inline them in an email or add them as comments to your doc? >> >> >>> On Jun 17, 2020, at 4:26 PM, Wes McKinney <wesmck...@gmail.com> wrote: >>> >>> hi Radu, >>> >>> I'll read the proposals in more detail when I can and make comments, >>> but this has always been something of interest (see, e.g. [1]). The >>> intent with the "C++ data frames" project that we've discussed (and I >>> continue to labor towards, e.g. recent compute engine work is directly >>> in service of this) has always been to be able to express computations >>> on non-RAM-resident datasets [2] >>> >>> As one initial high level point of discussion, I think what you're >>> describing in these documents should probably be _new_ C++ classes and >>> _new_ virtual interfaces, not an evolution of the current arrow::Table >>> or arrow::Array/ChunkedArray classes. One practical path forward in >>> terms of discussing implementation issues would be to draft header >>> files proposing what these new class interfaces look like. >>> >>> - Wes >>> >>> [1]: https://issues.apache.org/jira/browse/ARROW-1329 >>> [2]: >> https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h >>> >>> On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu >>> <radukay...@yahoo.com.invalid> wrote: >>>> >>>> Hi folks, >>>> While I’ve been communicating with some members of this group in the >> past, this is my first official post so please excuse/correct/guide me as >> needed. >>>> >>>> Logistics first: >>>> I put most of the content of my proposals in google doc, but if more >> appropriate, we can keep the conversation going by email. >>>> Also the two proposals are pretty independent, so if needed we can >> break it into two separate email threads, but for now I wanted to keep the >> spam low >>>> >>>> Actual proposals: >>>> Virtual Array - The idea is to be able to handle arrow Tables where >> some of the column data is not (yet) available in memory. For example a >> Table can map to a parquet file, create VirtualArrays for each column chunk >> and only read the actual content if and when the Array is touched. >>>> Virtualize arrow Table < >> https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing >>> >>>> Random Access - I find that “application state” for most large scale >> systems is compatible with low level vectorized arrow representation and I >> propose a number of API expansions that would enable thread safe data >> mutation and efficient random access. >>>> Arrow arrays random access < >> https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing >>> >>>> Please let me know what you think and what is the best course of action >> moving forward. >>>> Thank you >>>> Radu >> >>