Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Julian Hyde Mon, 26 Oct 2015 16:40:46 -0700

+100

Thanks for spearheading this, Jacques.


They say memory is the new disk. So, it’s no longer sufficient to use the same 
on-disk data format if we want our tools to interoperate. The idea of engines 
interoperating by reading the same in-memory temporary tables, and passing data 
from one engine to another, is very exciting.

Also exciting is the idea that, by pooling our resources, we can spend less 
time maintaining all of this tricky code. :)

I know that the Hive and Storm teams have done a lot of work in this area 
already, and have their own technology, but I will encourage them to be part of 
this initiative.

Julian


> On Oct 26, 2015, at 3:35 PM, Ted Dunning <[email protected]> wrote:
> 
> This sounds like a really good idea to me.
> 
> 
> 
> On Mon, Oct 26, 2015 at 2:50 PM, Julien Le Dem <[email protected]> wrote:
> 
>> +1, looking forward to vectorized Parquet Readers/Writers in Drill.
>> Making VV a standalone standard sounds great to me.
>> 
>> On Mon, Oct 26, 2015 at 2:46 PM, Parth Chandra <[email protected]> wrote:
>> 
>>> +1. Agree with Hanifi that we probably should have done this sooner :).
>>> Jason and I faced this need when trying to get a stand alone vectorized
>>> parquet reader out of the Drill code last year.
>>> 
>>> 
>>> 
>>> On Mon, Oct 26, 2015 at 2:37 PM, Hanifi Gunes <[email protected]>
>> wrote:
>>> 
>>>> I was hoping to see this discussion happening sooner :) VVs has helped
>>>> Drill representing and moving data around so flexibly that it would not
>>> be
>>>> hard to prove its usefulness to the community as a standalone library.
>> I
>>> am
>>>> in support of this proposal.
>>>> 
>>>> 
>>>> -Hanifi
>>>> 
>>>> On Mon, Oct 26, 2015 at 2:19 PM, Jacques Nadeau <[email protected]>
>>>> wrote:
>>>> 
>>>>> Drillers,
>>>>> 
>>>>> 
>>>>> 
>>>>> A number of people have approached me recently about the possibility
>> of
>>>>> collaborating on a shared columnar in-memory representation of data.
>>> This
>>>>> shared representation of data could be operated on efficiently with
>>>> modern
>>>>> cpus as well as shared efficiently via shared memory, IPC and RPC.
>> This
>>>>> would allow multiple applications to work together at high speed.
>>>> Examples
>>>>> include moving back and forth between a library.
>>>>> 
>>>>> 
>>>>> 
>>>>> As I was discussing these ideas with people working on projects
>>> including
>>>>> Calcite, Ibis, Kudu, Storm, Herron, Parquet and products from
>> companies
>>>>> like MapR and Trifacta, it became clear that much of what the Drill
>>>>> community has already constructed is very relevant to the goals of a
>>> new
>>>>> broader interchange and execution format. (In fact, Ted and I
>> actually
>>>>> informally discussed extracting this functionality as a library more
>>> than
>>>>> two years ago.)
>>>>> 
>>>>> 
>>>>> 
>>>>> A standard will emerge around this need and it is in the best
>> interest
>>> of
>>>>> the Drill community and the broader ecosystem if Drill’s ValueVectors
>>>>> concepts and code form the basis of a new
>>> library/collaboration/project.
>>>>> This means better interoperability, shared responsibility around
>>>>> maintenance and development and the avoidance of further division of
>>> the
>>>>> ecosystem.
>>>>> 
>>>>> 
>>>>> 
>>>>> A little background for some: Drill is the first project to create a
>>>>> powerful language agnostic in-memory representation of complex
>> columnar
>>>>> data. We've learned a lot over the last three years about how to
>>>> interface
>>>>> with these structures, manage memory associated with them, adjust
>> their
>>>>> sizes, expose them in builder patterns, etc. That work is useful for
>> a
>>>>> number of systems and it would be great if we could share the
>> learning.
>>>> By
>>>>> creating a new, well documented and collaborative library, people
>> could
>>>>> leverage this functionality in wider range of applications and
>> systems.
>>>>> 
>>>>> 
>>>>> 
>>>>> I’ve seen the great success that libraries like Parquet and Calcite
>>> have
>>>>> been able to achieve due to their focus on APIs, extensibility and
>>>>> reusability and I think we could do the same with the Drill
>> ValueVector
>>>>> codebase. The fact that this would allow higher speed interchange
>> among
>>>>> many other systems and becoming the standard for in-memory columnar
>>>>> exchange (as opposed to having to adopt an external standard) makes
>>> this
>>>> a
>>>>> great opportunity to both benefit the Drill community and give back
>> to
>>>> the
>>>>> broader Apache community.
>>>>> 
>>>>> 
>>>>> 
>>>>> As such, I’d like to open a discussion about taking this path. I
>> think
>>>>> there would be various avenues of how to do this but my initial
>>> proposal
>>>>> would be to propose this as a new project that goes straight to a
>>>>> provisional TLP. We then would work to clean up layer
>> responsibilities
>>>> and
>>>>> extract pieces of the code into this new project where we collaborate
>>>> with
>>>>> a wider group on a broader implementation (and more formal
>>>> specification).
>>>>> 
>>>>> 
>>>>> Given the conversations I have had and the excitement and need for
>>> this,
>>>> I
>>>>> think we should do this. If the community is supportive, we could
>>>> probably
>>>>> see some really cool integrations around things like high-speed
>> Python
>>>>> machine learning inside Drill operators before the end of the year.
>>>>> 
>>>>> 
>>>>> 
>>>>> I’ll open a new JIRA and attach it here where we can start a POC &
>>>>> discussion of how we could extract this code.
>>>>> 
>>>>> 
>>>>> Looking forward to feedback!
>>>>> 
>>>>> 
>>>>> Jacques
>>>>> 
>>>>> 
>>>>> --
>>>>> Jacques Nadeau
>>>>> CTO and Co-Founder, Dremio
>>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Julien
>>

Re: [DISCUSS] Proposal to turn ValueVectors into separate reusable library & project

Reply via email to