Yeah, that makes a ton of sense; I don't know how much a non-generic, Avro-only solution would be used outside of this specific context, where a team has all of their data in Avro already. contrib/, perhaps?
J On Tue, Mar 20, 2018 at 1:35 PM, Stephen Durfey <sjdur...@gmail.com> wrote: > The problem I was trying to solve was not wanting to deal with HCatRecords > (which are basically GenericRecords) in the M/R code. Particularly with all > the code we (my team/org) have around today written against the specific > record models. So, HCat can be used for its benefits (abstracting away on > disk file format, data discoverability, a single schema for multiple tools > to plug into), without having to change existing code to work with > HCatRecords. > > Now, this does require the data on disk to be avro, so its a bit of an > abstraction leak knowing that. It is also rather difficult to solve > generically, as the storage handlers return the data differently (e.g. hdfs > handler returns avro as is, but hbase handler returns maps of structs). So, > there is already a bit of a leak when dealing with HCatRecords since you > need to know where the data is coming from in order to know how to process > it. So, its probably ok. My current solution only works for avro files on > disk. > > An example of it working with the existing HCatSource: > https://github.com/sjdurfey/crunch/blob/from-hcat-avro/ > crunch-hcatalog/src/main/java/org/apache/crunch/io/hcatalog/ > FromHCat.java#L162-L193 > > On Tue, Mar 13, 2018 at 11:43 PM, Josh Wills <josh.wi...@gmail.com> wrote: > > > The Crunch-specific argument in favor of it, at least from my > perspective, > > is that a) Crunch is the most Avro-friendly pipeline library around > (almost > > to a fault, you might argue), and b) a Hive/HCat dependency in Avro > itself > > would make even less sense, and if you're using Hive SerDe/HQL/UDF, then > > you're already bought into Hive's data model anyway and an HCat -> Avro > > translator wouldn't make much sense. > > > > Does the conversion buy me anything from a MapReduce perspective? One of > my > > favorite aspects of Avro has always been the super-fast serialization of > > data during the shuffle of MR jobs, even compared to Thrift/Protobuf b/c > > the schema allows you to skip so much id overhead, but since HCat has the > > same schema benefits as Avro, the performance argument doesn't seem as > > obviously compelling to me at first blush. > > > > Josh > > > > P.S. This is exactly the place to discuss this sort of thing! > > > > On Tue, Mar 13, 2018 at 12:10 PM, Stephen Durfey <sjdur...@gmail.com> > > wrote: > > > > > I wasn't sure if this was the place to discuss this or on a JIRA. To > > > followup on the work for the HCatSource, I wrote some code to convert > an > > > HCatRecord into a specific avro model. This way you can read from HCat, > > but > > > still deal with avro models instead of HCatRecord. It really isn't the > > > ideal path as HCat already does SerDe operations when it does schema > > > resolution. So, it's rather a rather inefficient path. At least with > this > > > code it's all in memory pointer moving, rather than additional byte > > > serialization. > > > > > > HCatalog could be used to just find the desired partitions, and then a > > > regular pipeline can be used to read in the avro models. However, you > > then > > > need to know where the data is located and its file format, which is > one > > of > > > the biggest benefits of the HCatalog. > > > > > > I'm not really sure if this conversion code in crunch is the right > place > > > for it to live long term, as there isn't anything about it that is > crunch > > > specific. Hive seems like the better place for it to live. However, I'm > > not > > > sure when they would get around to committing it, and it would likely > be > > in > > > a version that is beyond what we support today in Crunch (hive 2.1). > So, > > > maybe in crunch short term until it is accepted by hive? > > > > > > If this code is valuable, I can open a JIRA and we can go from there. > > > > > > Code: > > > https://github.com/sjdurfey/crunch/commit/ > f659dfe06f50862b9f674c1e2dd699 > > > a5c53b2b1f > > > > > > The code to look at would be the HCatDecoder, HCatParser, and the > various > > > HCatParse contexts. > > > > > >