RE: Comparing with Parquet

Andrew Brust Thu, 25 Feb 2016 11:21:07 -0800

Is there a dumbed-down version of as summary for how and why in-mem and on-disk 
formats differ?  Is it mostly around aligning things for SIMD/vectorization?


There is probably some ignorance in my question, but I'm comfortable with that. 
:-)

-----Original Message-----
From: Wes McKinney [mailto:[email protected]] 
Sent: Thursday, February 25, 2016 12:12 PM
To: [email protected]
Subject: Re: Comparing with Parquet

We wrote about this in a recent blog post:

http://blog.cloudera.com/blog/2016/02/introducing-apache-arrow-a-fast-interoperable-in-memory-columnar-data-structure-standard/

"Apache Parquet is a compact, efficient columnar data storage designed for 
storing large amounts of data stored in HDFS. Arrow is an ideal in-memory 
“container” for data that has been deserialized from a Parquet file, and 
similarly in-memory Arrow data can be serialized to Parquet and written out to 
a filesystem like HDFS or Amazon S3. Arrow and Parquet are thus companion 
projects."

For example, one of my personal motivations for being involved in both Arrow 
and Parquet is to use Arrow as the in-memory container for data deserialized 
from Parquet for use in Python and R.

- Wes

On Thu, Feb 25, 2016 at 8:20 AM, Henry Robinson <[email protected]> wrote:
> Think of Parquet as a format well-suited to writing very large datasets to 
> disk, whereas Arrow is a format most suited to efficient storage in memory. 
> You might read Parquet files from disk, and then materialize them in memory 
> in Arrow's format.
>
> Both formats are designed around the idiosyncrasies of the target medium: 
> Parquet is not designed to support efficient random access because disks 
> aren't good at that, but Arrow has fast random access  as a core design 
> principle, to give just one example.
>
> Henry
>
>> On Feb 25, 2016, at 8:10 AM, Sourav Mazumder <[email protected]> 
>> wrote:
>>
>> Hi All,
>>
>> New to this. And still trying to figure out where exactly Arrow fits 
>> in the ecosystem of various Big Data technologies.
>>
>> In that respect first thing which came to my mind is how does Arrow 
>> compare with parquet.
>>
>> In my understanding Parquet also supports a very efficient columnar 
>> format (with support for nested structure). It is already embraced 
>> (supported) by various technologies like Impala (origin), Spark, Drill etc.
>>
>> The only think I see missing in Parquet is support for SIMD based 
>> vectorized operations.
>>
>> Am I right or am I missing many other differences between Arrow and 
>> parquet ?
>>
>> Regards,
>> Sourav

RE: Comparing with Parquet

Reply via email to