Yeah, that makes a ton of sense; I don't know how much a non-generic,
Avro-only solution would be used outside of this specific context, where a
team has all of their data in Avro already. contrib/, perhaps?

J

On Tue, Mar 20, 2018 at 1:35 PM, Stephen Durfey <sjdur...@gmail.com> wrote:

> The problem I was trying to solve was not wanting to deal with HCatRecords
> (which are basically GenericRecords) in the M/R code. Particularly with all
> the code we (my team/org) have around today written against the specific
> record models. So, HCat can be used for its benefits (abstracting away on
> disk file format, data discoverability, a single schema for multiple tools
> to plug into), without having to change existing code to work with
> HCatRecords.
>
> Now, this does require the data on disk to be avro, so its a bit of an
> abstraction leak knowing that. It is also rather difficult to solve
> generically, as the storage handlers return the data differently (e.g. hdfs
> handler returns avro as is, but hbase handler returns maps of structs). So,
> there is already a bit of a leak when dealing with HCatRecords since you
> need to know where the data is coming from in order to know how to process
> it. So, its probably ok. My current solution only works for avro files on
> disk.
>
> An example of it working with the existing HCatSource:
> https://github.com/sjdurfey/crunch/blob/from-hcat-avro/
> crunch-hcatalog/src/main/java/org/apache/crunch/io/hcatalog/
> FromHCat.java#L162-L193
>
> On Tue, Mar 13, 2018 at 11:43 PM, Josh Wills <josh.wi...@gmail.com> wrote:
>
> > The Crunch-specific argument in favor of it, at least from my
> perspective,
> > is that a) Crunch is the most Avro-friendly pipeline library around
> (almost
> > to a fault, you might argue), and b) a Hive/HCat dependency in Avro
> itself
> > would make even less sense, and if you're using Hive SerDe/HQL/UDF, then
> > you're already bought into Hive's data model anyway and an HCat -> Avro
> > translator wouldn't make much sense.
> >
> > Does the conversion buy me anything from a MapReduce perspective? One of
> my
> > favorite aspects of Avro has always been the super-fast serialization of
> > data during the shuffle of MR jobs, even compared to Thrift/Protobuf b/c
> > the schema allows you to skip so much id overhead, but since HCat has the
> > same schema benefits as Avro, the performance argument doesn't seem as
> > obviously compelling to me at first blush.
> >
> > Josh
> >
> > P.S. This is exactly the place to discuss this sort of thing!
> >
> > On Tue, Mar 13, 2018 at 12:10 PM, Stephen Durfey <sjdur...@gmail.com>
> > wrote:
> >
> > > I wasn't sure if this was the place to discuss this or on a JIRA. To
> > > followup on the work for the HCatSource, I wrote some code to convert
> an
> > > HCatRecord into a specific avro model. This way you can read from HCat,
> > but
> > > still deal with avro models instead of HCatRecord. It really isn't the
> > > ideal path as HCat already does SerDe operations when it does schema
> > > resolution. So, it's rather a rather inefficient path. At least with
> this
> > > code it's all in memory pointer moving, rather than additional byte
> > > serialization.
> > >
> > > HCatalog could be used to just find the desired partitions, and then a
> > > regular pipeline can be used to read in the avro models. However, you
> > then
> > > need to know where the data is located and its file format, which is
> one
> > of
> > > the biggest benefits of the HCatalog.
> > >
> > > I'm not really sure if this conversion code in crunch is the right
> place
> > > for it to live long term, as there isn't anything about it that is
> crunch
> > > specific. Hive seems like the better place for it to live. However, I'm
> > not
> > > sure when they would get around to committing it, and it would likely
> be
> > in
> > > a version that is beyond what we support today in Crunch (hive 2.1).
> So,
> > > maybe in crunch short term until it is accepted by hive?
> > >
> > > If this code is valuable, I can open a JIRA and we can go from there.
> > >
> > > Code:
> > > https://github.com/sjdurfey/crunch/commit/
> f659dfe06f50862b9f674c1e2dd699
> > > a5c53b2b1f
> > >
> > > The code to look at would be the HCatDecoder, HCatParser, and the
> various
> > > HCatParse contexts.
> > >
> >
>

Reply via email to