On 04/08/2015 09:49 AM, Karthikeyan Muthukumar wrote:
Thanks Jacques and Alex.
I have been successfully using Avro model to write to Parquet files and
found that quite logical, because Avro is quite rich.
Are there any functional or performance impacts of using Avro model based
Parquet files, specifically w.r.t accessing the generated Parquet files
through other tools like Drill, SparkSQL etc?
Thanks & Regards
MK

Hi MK,

If Avro is the data model you're interested in using in your application, then parquet-avro is a good choice.

For an application, it is perfectly reasonable to use Avro objects. There are a few reasons for this:
1. You have existing code based on the Avro format and object model
2. You want to use Avro-generated classes (avro-specific)
3. You want to use your own Java classes via reflection (avro-reflect)
4. You want compatibility with both storage formats

Similarly, you could use parquet-thrift if you preferred using Thrift objects or had existing Thrift code. (Or scrooge, or protobuf, etc.)

The only reason you would want to build your own object model is if you are doing a translation step later. For example, Hive can translate Avro objects to the form it expects, but instead we implemented a Hive object model to go directly from Parquet to Hive's representation. That's faster and doesn't require copying the data. This is why Drill, SparkSQL, Hive, and others have their own data models.

rb

--
Ryan Blue
Software Engineer
Cloudera, Inc.

Reply via email to