Great to meet you all! I've recently been collaborating with the Apache Drill team to spin out the ValueVector columnar in-memory data structure into a new standalone project that will be called Arrow [1] [2]. A brief summary of Arrow/ValueVectors is that it permits O(1) random access on nested columnar structures and is efficient for projections and scans in a columnar SQL setting.
I'm very interested in making Parquet read/write support available to Python programmers via C/C++ extensions, so I'm going to be working the next few months on a Parquet->Arrow->Python toolchain, along with some tools to manipulate tables in-memory columnar data in the style of Python's pandas library. I will propose patches as needed to parquet-cpp to improve its performance and add functionality for writing Parquet files as well. The details of converting to/from Parquet's repetition/definition level representation of nested data will stay separate in the arrow-parquet adapter code. cheers, Wes [1]: http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E [2]: http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490 On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <[email protected]> wrote: > Hi, > > I'm very interested in this subject because I would like to export parquet > data from HDFS to Vertica (using VSQL). > I'm planning to work on it next quarter, but I will be very happy to help > you on this subject (review, testing). > > Have a nice day, > -- > Mickaël Lacour > Senior Software Engineer > Analytics Infrastructure team @Scalability > > ________________________________________ > From: Walkauskas, Stephen Gregory (Vertica) <[email protected]> > Sent: Thursday, January 14, 2016 3:23 PM > To: Sandryhaila, Aliaksei; [email protected]; Majeti, Deepak; > [email protected]; Wes McKinney > Subject: Re: Parquet-cpp > > Yes, thanks for the introduction Julien. > > Nong and Wes, > > It'd be interesting to know your goals for parquet-cpp. > > The Vertica database already supports optimized reads of ORC files (fast > c++ parser, predicate pushdown, columns selection etc). We'd like to do > the same for parquet. > > Cheers, > Stephen > > On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote: > > Thank you for the introduction, Julien! > > > > Hello Nong and Wes, > > > > Stephen, Deepak and I are developing a C++ library to support Parquet in > > Vertica RDBMS. We are using Parquet-cpp as a starting point and are > > expanding its functionality as well as improving it and fixing bugs. We > > would like to contribute these improvements back to the open-source > > community. We plan to do this through the usual process of creating > > jiras that justify and explain a code change, and then submitting pull > > requests. We look forward to working with you on Parquet-cpp and to your > > feedback and suggestions. > > > > Best regards, > > Aliaksei. > > > > > > On 01/13/2016 02:54 PM, Julien Le Dem wrote: > >> Hello Nong, Wes, Stephen, Deepak and Aliaksei > >> I wanted to introduce you to each other as you are all looking at > >> Parquet-cpp. > >> > >> I'd recommend opening JIRAs in the parquet-cpp component to collaborate > (I > >> see you already doing this): > >> > https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp > >> > >> Nong is a committer and can merged pull requests (he also understands > that > >> code base very well). > >> Other committer can too, feel free to ping us if you need help > >> Obviously, you don't need to be a committer to give others reviews (you > >> just need one to approve and merge). > >> > > >
