Parquet PyData integration

Peter Prettenhofer Thu, 20 Nov 2014 09:00:33 -0800

Hi all,

I'd like to integrate Parquet with pandas, a popular Python library for
in-memory data analysis.
My plan is to build an efficient connector based on the parquet-cpp project
-- is that the recommended way to do this?
Somebody told me that Impala's parquet reader is much more performant but
also tightly integrated into Impala and hard to extract (I havent checked
the licencing and if that is allowed at all). Is this still correct?


As far as the readme file goes: parquet-cpp only supports reading parquet
files but not writing. Do you plan to support writing access in the future
via this cpp API?
Furthermore, pushing down predicates would be very important for my
use-case -- does parquet-cpp allow to do that? I haven't seen anything in
the code-base yet.

My current plan is to start from this example [1] and write a thin wrapper
in Cython to expose some of the column reader functionality.

Any thoughts/remarks/concerns highly recommended.

Thanks,
 Peter

[1]
https://github.com/apache/incubator-parquet-cpp/blob/master/example/compute_stats.cc

-- 
Peter Prettenhofer

Parquet PyData integration

Reply via email to