Re: Parquet-cpp

Wes McKinney Fri, 15 Jan 2016 11:07:57 -0800

Great to meet you all!

I've recently been collaborating with the Apache Drill team to spin out the
ValueVector columnar in-memory data structure into a new standalone project
that will be called Arrow [1] [2]. A brief summary of Arrow/ValueVectors is
that it permits O(1) random access on nested columnar structures and is
efficient for projections and scans in a columnar SQL setting.


I'm very interested in making Parquet read/write support available to
Python programmers via C/C++ extensions, so I'm going to be working the
next few months on a Parquet->Arrow->Python toolchain, along with some
tools to manipulate tables in-memory columnar data in the style of Python's
pandas library.

I will propose patches as needed to parquet-cpp to improve its performance
and add functionality for writing Parquet files as well. The details of
converting to/from Parquet's repetition/definition level representation of
nested data will stay separate in the arrow-parquet adapter code.

cheers,
Wes

[1]:
http://mail-archives.apache.org/mod_mbox/drill-dev/201510.mbox/%3CCAJrw0OSVoirU_EUrBBqKY12uDi_f8U9MP7J_6Puuh_DmcyzS9g%40mail.gmail.com%3E
[2]:
http://permalink.gmane.org/gmane.comp.apache.incubator.drill.devel/16490

On Fri, Jan 15, 2016 at 1:22 AM, Mickaël Lacour <[email protected]> wrote:

> Hi,
>
> I'm very interested in this subject because I would like to export parquet
> data from HDFS to Vertica (using VSQL).
> I'm planning to work on it next quarter, but I will be very happy to help
> you on this subject (review, testing).
>
> Have a nice day,
> --
> Mickaël Lacour
> Senior Software Engineer
> Analytics Infrastructure team @Scalability
>
> ________________________________________
> From: Walkauskas, Stephen Gregory (Vertica) <[email protected]>
> Sent: Thursday, January 14, 2016 3:23 PM
> To: Sandryhaila, Aliaksei; [email protected]; Majeti, Deepak;
> [email protected]; Wes McKinney
> Subject: Re: Parquet-cpp
>
> Yes, thanks for the introduction Julien.
>
> Nong and Wes,
>
> It'd be interesting to know your goals for parquet-cpp.
>
> The Vertica database already supports optimized reads of ORC files (fast
> c++ parser, predicate pushdown, columns selection etc). We'd like to do
> the same for parquet.
>
> Cheers,
> Stephen
>
> On 01/13/2016 05:53 PM, Sandryhaila, Aliaksei wrote:
> > Thank you for the introduction, Julien!
> >
> > Hello Nong and Wes,
> >
> > Stephen, Deepak and I are developing a C++ library to support Parquet in
> > Vertica RDBMS. We are using Parquet-cpp as a starting point and are
> > expanding its functionality as well as improving it and fixing bugs. We
> > would like to contribute these improvements back to the open-source
> > community. We plan to do this through the usual process of creating
> > jiras that justify and explain a code change, and then submitting pull
> > requests. We look forward to working with you on Parquet-cpp and to your
> > feedback and suggestions.
> >
> > Best regards,
> > Aliaksei.
> >
> >
> > On 01/13/2016 02:54 PM, Julien Le Dem wrote:
> >> Hello Nong, Wes, Stephen, Deepak and Aliaksei
> >> I wanted to introduce you to each other as you are all looking at
> >> Parquet-cpp.
> >>
> >> I'd recommend opening JIRAs in the parquet-cpp component to collaborate
> (I
> >> see you already doing this):
> >>
> https://issues.apache.org/jira/browse/PARQUET-418?jql=project%20%3D%20PARQUET%20AND%20component%20%3D%20parquet-cpp
> >>
> >> Nong is a committer and can merged pull requests (he also understands
> that
> >> code base very well).
> >> Other committer can too, feel free to ping us if you need help
> >> Obviously, you don't need to be a committer to give others reviews (you
> >> just need one to approve and merge).
> >>
> >
>

Re: Parquet-cpp

Reply via email to