I wasn't sure if this was the place to discuss this or on a JIRA. To
followup on the work for the HCatSource, I wrote some code to convert an
HCatRecord into a specific avro model. This way you can read from HCat, but
still deal with avro models instead of HCatRecord. It really isn't the
ideal path as HCat already does SerDe operations when it does schema
resolution. So, it's rather a rather inefficient path. At least with this
code it's all in memory pointer moving, rather than additional byte
serialization.

HCatalog could be used to just find the desired partitions, and then a
regular pipeline can be used to read in the avro models. However, you then
need to know where the data is located and its file format, which is one of
the biggest benefits of the HCatalog.

I'm not really sure if this conversion code in crunch is the right place
for it to live long term, as there isn't anything about it that is crunch
specific. Hive seems like the better place for it to live. However, I'm not
sure when they would get around to committing it, and it would likely be in
a version that is beyond what we support today in Crunch (hive 2.1). So,
maybe in crunch short term until it is accepted by hive?

If this code is valuable, I can open a JIRA and we can go from there.

Code:
https://github.com/sjdurfey/crunch/commit/f659dfe06f50862b9f674c1e2dd699a5c53b2b1f

The code to look at would be the HCatDecoder, HCatParser, and the various
HCatParse contexts.

Reply via email to