The problem I was trying to solve was not wanting to deal with HCatRecords
(which are basically GenericRecords) in the M/R code. Particularly with all
the code we (my team/org) have around today written against the specific
record models. So, HCat can be used for its benefits (abstracting away on
disk file format, data discoverability, a single schema for multiple tools
to plug into), without having to change existing code to work with
HCatRecords.

Now, this does require the data on disk to be avro, so its a bit of an
abstraction leak knowing that. It is also rather difficult to solve
generically, as the storage handlers return the data differently (e.g. hdfs
handler returns avro as is, but hbase handler returns maps of structs). So,
there is already a bit of a leak when dealing with HCatRecords since you
need to know where the data is coming from in order to know how to process
it. So, its probably ok. My current solution only works for avro files on
disk.

An example of it working with the existing HCatSource:
https://github.com/sjdurfey/crunch/blob/from-hcat-avro/crunch-hcatalog/src/main/java/org/apache/crunch/io/hcatalog/FromHCat.java#L162-L193

On Tue, Mar 13, 2018 at 11:43 PM, Josh Wills <josh.wi...@gmail.com> wrote:

> The Crunch-specific argument in favor of it, at least from my perspective,
> is that a) Crunch is the most Avro-friendly pipeline library around (almost
> to a fault, you might argue), and b) a Hive/HCat dependency in Avro itself
> would make even less sense, and if you're using Hive SerDe/HQL/UDF, then
> you're already bought into Hive's data model anyway and an HCat -> Avro
> translator wouldn't make much sense.
>
> Does the conversion buy me anything from a MapReduce perspective? One of my
> favorite aspects of Avro has always been the super-fast serialization of
> data during the shuffle of MR jobs, even compared to Thrift/Protobuf b/c
> the schema allows you to skip so much id overhead, but since HCat has the
> same schema benefits as Avro, the performance argument doesn't seem as
> obviously compelling to me at first blush.
>
> Josh
>
> P.S. This is exactly the place to discuss this sort of thing!
>
> On Tue, Mar 13, 2018 at 12:10 PM, Stephen Durfey <sjdur...@gmail.com>
> wrote:
>
> > I wasn't sure if this was the place to discuss this or on a JIRA. To
> > followup on the work for the HCatSource, I wrote some code to convert an
> > HCatRecord into a specific avro model. This way you can read from HCat,
> but
> > still deal with avro models instead of HCatRecord. It really isn't the
> > ideal path as HCat already does SerDe operations when it does schema
> > resolution. So, it's rather a rather inefficient path. At least with this
> > code it's all in memory pointer moving, rather than additional byte
> > serialization.
> >
> > HCatalog could be used to just find the desired partitions, and then a
> > regular pipeline can be used to read in the avro models. However, you
> then
> > need to know where the data is located and its file format, which is one
> of
> > the biggest benefits of the HCatalog.
> >
> > I'm not really sure if this conversion code in crunch is the right place
> > for it to live long term, as there isn't anything about it that is crunch
> > specific. Hive seems like the better place for it to live. However, I'm
> not
> > sure when they would get around to committing it, and it would likely be
> in
> > a version that is beyond what we support today in Crunch (hive 2.1). So,
> > maybe in crunch short term until it is accepted by hive?
> >
> > If this code is valuable, I can open a JIRA and we can go from there.
> >
> > Code:
> > https://github.com/sjdurfey/crunch/commit/f659dfe06f50862b9f674c1e2dd699
> > a5c53b2b1f
> >
> > The code to look at would be the HCatDecoder, HCatParser, and the various
> > HCatParse contexts.
> >
>

Reply via email to