Thank you Wes! Yes, both proposals fit very nicely in your Data Frames vision, I see them as deep dives on some specifics: - the virtual array doc is more fluffy an probably if you agree with the general concept, the next logical move is to put out some interfaces indeed - the random access doc goes into more details and I am curious what you think about some of the concepts
I will follow up shortly with some interfaces - do you prefer references to a repo, inline them in an email or add them as comments to your doc? > On Jun 17, 2020, at 4:26 PM, Wes McKinney <wesmck...@gmail.com> wrote: > > hi Radu, > > I'll read the proposals in more detail when I can and make comments, > but this has always been something of interest (see, e.g. [1]). The > intent with the "C++ data frames" project that we've discussed (and I > continue to labor towards, e.g. recent compute engine work is directly > in service of this) has always been to be able to express computations > on non-RAM-resident datasets [2] > > As one initial high level point of discussion, I think what you're > describing in these documents should probably be _new_ C++ classes and > _new_ virtual interfaces, not an evolution of the current arrow::Table > or arrow::Array/ChunkedArray classes. One practical path forward in > terms of discussing implementation issues would be to draft header > files proposing what these new class interfaces look like. > > - Wes > > [1]: https://issues.apache.org/jira/browse/ARROW-1329 > [2]: > https://docs.google.com/document/d/1XHe_j87n2VHGzEbnLe786GHbbcbrzbjgG8D0IXWAeHg/edit#heading=h.g70gstc7jq4h > > On Wed, Jun 17, 2020 at 2:48 PM Radu Teodorescu > <radukay...@yahoo.com.invalid> wrote: >> >> Hi folks, >> While I’ve been communicating with some members of this group in the past, >> this is my first official post so please excuse/correct/guide me as needed. >> >> Logistics first: >> I put most of the content of my proposals in google doc, but if more >> appropriate, we can keep the conversation going by email. >> Also the two proposals are pretty independent, so if needed we can break it >> into two separate email threads, but for now I wanted to keep the spam low >> >> Actual proposals: >> Virtual Array - The idea is to be able to handle arrow Tables where some of >> the column data is not (yet) available in memory. For example a Table can >> map to a parquet file, create VirtualArrays for each column chunk and only >> read the actual content if and when the Array is touched. >> Virtualize arrow Table >> <https://docs.google.com/document/d/1qXSHSgMZtjNGzWrqDxoBisSoR6gbnRiEztnYihNGLsI/edit?usp=sharing> >> Random Access - I find that “application state” for most large scale systems >> is compatible with low level vectorized arrow representation and I propose a >> number of API expansions that would enable thread safe data mutation and >> efficient random access. >> Arrow arrays random access >> <https://docs.google.com/document/d/1tIsOhN6mfIAy6F8XRxeKRIqPBN0gKbcmrp2QJ4L3hJ8/edit?usp=sharing> >> Please let me know what you think and what is the best course of action >> moving forward. >> Thank you >> Radu