ProtoBuf

Ryan Blue Wed, 08 Apr 2015 10:09:12 -0700

On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:

Thanks Jacques and Alex.
I have been successfully using Avro model to write to Parquet files and
found that quite logical, because Avro is quite rich.
Are there any functional or performance impacts of using Avro model based
Parquet files, specifically w.r.t accessing the generated Parquet files
through other tools like Drill, SparkSQL etc?
Thanks & Regards
MK


Hi MK,

If Avro is the data model you're interested in using in yourapplication, then parquet-avro is a good choice.

For an application, it is perfectly reasonable to use Avro objects.There are a few reasons for this:

1. You have existing code based on the Avro format and object model
2. You want to use Avro-generated classes (avro-specific)
3. You want to use your own Java classes via reflection (avro-reflect)
4. You want compatibility with both storage formats

Similarly, you could use parquet-thrift if you preferred using Thriftobjects or had existing Thrift code. (Or scrooge, or protobuf, etc.)

The only reason you would want to build your own object model is if youare doing a translation step later. For example, Hive can translate Avroobjects to the form it expects, but instead we implemented a Hive objectmodel to go directly from Parquet to Hive's representation. That'sfaster and doesn't require copying the data. This is why Drill,SparkSQL, Hive, and others have their own data models.


rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: Writing directly to Parquet without Avro/Thrift/ProtoBuf

Reply via email to