[
https://issues.apache.org/jira/browse/HCATALOG-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13437042#comment-13437042
]
Travis Crawford commented on HCATALOG-443:
------------------------------------------
Hey Sushanth -
Thanks for taking a look. We actually kept going with this branch and most of
our changes are in this branch. I'm planning to extract the features out and
submit as separate jira issues. Basically we built up a chain of dependent
patches and just had to use the branch for a while.
https://github.com/traviscrawford/hcatalog/compare/HCATALOG-443_api_to_metadata_deserializer
Noteworthy features:
* Use "o.a.h.hive.ql.metadata" classes pretty much everywhere (instead of
"o.a.h.hive.metastore.api" classes). The ql.metadata versions add some
additional business logic, most importantly serde-reported columns.
* Handle a user-defined amount of bad records. Instead of relying on MR bad
record skipping (which causes lots of task failures) we simply skip the record.
The user has some knobs to define what % of their data can be skipped, or none
at all.
* Binary support. This is really a Hive issue, but we handle it anyway. Thrift
stores binary data as a HeapByteBuffer, and the binary object inspector thinks
its a struct with four fields, one of which is the actual binary data. I've
been having some issues getting patches into Hive, so worked around the issue
in HCat.
* Initialize SerDe with partition properties, instead of the table properties.
* Lots of other misc improvements.
Stuff that can't go in trunk yet because of the Pig 0.8.0 dependency:
* Support for maps with schemas. We need to update the HCatalog dependency to
>= pig 0.9.0 for this to work in unittests, but the patch works at runtime if
you use a newer Pig with that support.
* Boolean support (by removing the checks) since this works in our environment.
> Serde-reported schema support, enums as strings, misc fixes
> -----------------------------------------------------------
>
> Key: HCATALOG-443
> URL: https://issues.apache.org/jira/browse/HCATALOG-443
> Project: HCatalog
> Issue Type: Bug
> Reporter: Travis Crawford
> Assignee: Travis Crawford
> Attachments: HCATALOG-443_api_to_metadata_deserializer.1.patch,
> HCATALOG-443_api_to_metadata_deserializer.2.patch,
> HCATALOG-443_api_to_metadata_deserializer.3.patch,
> HCATALOG-443_api_to_metadata_deserializer.4.patch
>
>
> This issue is related to HIVE-2950.
> When HCatalog queries the HiveMetaStore it gets back classes in the
> "org.apache.hadoop.hive.metastore.api" package. This represents exactly what
> is stored in the metastore database.
> Hive has companion classes in "org.apache.hadoop.hive.ql.metadata" that
> provide some logic on top of what's stored in the actual database. For
> example:
> * org.apache.hadoop.hive.metastore.api.Table.getCols shows columns explicitly
> stored in the database
> * org.apache.hadoop.hive.ql.metadata.Table.getCols shows columns reported by
> the serde if there are any.
> Except when serializing stuff into the job configuration HCatalog should use
> the "metadata" version of these classes so that the additional logic is
> called.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira