The Crunch-specific argument in favor of it, at least from my perspective,
is that a) Crunch is the most Avro-friendly pipeline library around (almost
to a fault, you might argue), and b) a Hive/HCat dependency in Avro itself
would make even less sense, and if you're using Hive SerDe/HQL/UDF, then
you're already bought into Hive's data model anyway and an HCat -> Avro
translator wouldn't make much sense.

Does the conversion buy me anything from a MapReduce perspective? One of my
favorite aspects of Avro has always been the super-fast serialization of
data during the shuffle of MR jobs, even compared to Thrift/Protobuf b/c
the schema allows you to skip so much id overhead, but since HCat has the
same schema benefits as Avro, the performance argument doesn't seem as
obviously compelling to me at first blush.

Josh

P.S. This is exactly the place to discuss this sort of thing!

On Tue, Mar 13, 2018 at 12:10 PM, Stephen Durfey <sjdur...@gmail.com> wrote:

> I wasn't sure if this was the place to discuss this or on a JIRA. To
> followup on the work for the HCatSource, I wrote some code to convert an
> HCatRecord into a specific avro model. This way you can read from HCat, but
> still deal with avro models instead of HCatRecord. It really isn't the
> ideal path as HCat already does SerDe operations when it does schema
> resolution. So, it's rather a rather inefficient path. At least with this
> code it's all in memory pointer moving, rather than additional byte
> serialization.
>
> HCatalog could be used to just find the desired partitions, and then a
> regular pipeline can be used to read in the avro models. However, you then
> need to know where the data is located and its file format, which is one of
> the biggest benefits of the HCatalog.
>
> I'm not really sure if this conversion code in crunch is the right place
> for it to live long term, as there isn't anything about it that is crunch
> specific. Hive seems like the better place for it to live. However, I'm not
> sure when they would get around to committing it, and it would likely be in
> a version that is beyond what we support today in Crunch (hive 2.1). So,
> maybe in crunch short term until it is accepted by hive?
>
> If this code is valuable, I can open a JIRA and we can go from there.
>
> Code:
> https://github.com/sjdurfey/crunch/commit/f659dfe06f50862b9f674c1e2dd699
> a5c53b2b1f
>
> The code to look at would be the HCatDecoder, HCatParser, and the various
> HCatParse contexts.
>

Reply via email to