[
https://issues.apache.org/jira/browse/SQOOP-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641915#comment-13641915
]
Venkat Ranganathan commented on SQOOP-931:
------------------------------------------
THanks [~jarcec] for your review and comments.
I initially had the database separately, but the hcatalog team thought it made
more sense to have them together (as they process tables with that format).
But adding a database column should not be difficult if we need to make it more
compliant.
It is true that schema inference is a great feature. I thought of adding it in
a follow on JIRA with some additional constructs such that we still give the
user storage type independence if they so desire. For example if they want all
their tables to be whatever they choose (and supported by HCatalog) as the
default and pre-create tables if specific output file type is desired. I will
create a SUB task and get it in this task itself.
The main issue with not supporting hive-drop-delims is that string columns with
embedded delimiter chars and using delimited text format will have the fidelity
issues that the current users have. I considered that but was not sure if it
was worth the extra processing for all output types. I wanted to sqoop code to
be agnostic of the storage format (so not worry about querying the metadata on
the storage info), and users still have the option of using the current hive
import to deal with that case if it is so desired which is well understood.
Direct option does not deal with sqoop record type, so we have to come up with
a HCat implementation for each connection manager based on its input/output row
formats after parsing it. For example, in the case of Netezza direct mode,
Sqoop ORM scheme is not involved so we don't even generate the jar files. I
think the existing Hive import mechanism can be used where applicable (not
sure if it works with all connection managers, but since the output is text
format, the existing hive import support should help with that). As you know,
HCatalog uses hive metastore, so such tables are also available to HCatalog
users.
Regarding running as part of the normal test suite, I totally understand. I
also did not want it to be a manual test. If you look at the test utils, I use
a mini cluster like MR job to do the loading into HCatalog and reading off
HCatalog. HCatalog does not have a HCatalogMiniCluster (for unit testing).
When I first tried to run everything in local mode (which is supported), then
the hive tests failed (because we depend on lack of some classes to distinguish
between hive external cli or in process invocation). That is why I had to
exclude some of the Hive classes from the dependencies to make it run all unit
tests. Let me see if there is a way to accommodate both use cases (by
introducing additional test parameters to force external Hive CLI usage that
uses the mock Hive utils that we have in the unit test framework) and still get
HCatalog run as part of the unit tests.
Thanks
Venkat
> Integrate HCatalog with Sqoop
> -----------------------------
>
> Key: SQOOP-931
> URL: https://issues.apache.org/jira/browse/SQOOP-931
> Project: Sqoop
> Issue Type: New Feature
> Affects Versions: 1.4.2, 1.4.3
> Environment: All 1.x sqoop version
> Reporter: Venkat Ranganathan
> Assignee: Venkat Ranganathan
> Attachments: SQOOP-931.patch, SQOOP HCatalog Integration.pdf
>
>
> Apache HCatalog is a table and storage management service that provides a
> shared schema, data types and table abstraction freeing users from being
> concerned about where or how their data is stored. It provides
> interoperability across Pig, Map Reduce, and Hive.
> A sqoop hcatalog connector will help in supporting storage formats that are
> abstracted by HCatalog.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira