[ 
https://issues.apache.org/jira/browse/SQOOP-931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641915#comment-13641915
 ] 

Venkat Ranganathan commented on SQOOP-931:
------------------------------------------

THanks [~jarcec] for your review and comments.

I initially had the database separately, but the hcatalog team thought it made 
more sense to have them together (as they process tables with that format).  
But adding a database column should not be difficult if we need to make it more 
compliant.

It is true that schema inference is a great feature.  I thought of adding it in 
a follow on JIRA with some additional constructs such that we still give the 
user storage type independence if they so desire.  For example if they want all 
their tables to be  whatever they choose (and supported by HCatalog) as the 
default and pre-create tables if specific output file type is desired.   I will 
create a SUB task and get it in this task itself.

The main issue with not supporting hive-drop-delims is that string columns with 
embedded delimiter chars and using delimited text format will have the fidelity 
issues that the current users have.  I considered that but was not sure if it 
was worth the extra processing for all output types.  I wanted to sqoop code to 
be agnostic of the storage format (so not worry about querying the metadata on 
the storage info), and users still have the option of using the current hive 
import to deal with that case if it is so desired which is well understood.

Direct option does not deal with sqoop record type, so we have to come up with 
a HCat implementation for each connection manager based on its input/output row 
formats after parsing it.   For example, in the case of Netezza direct mode, 
Sqoop ORM scheme  is not involved so we don't even generate the jar files.  I 
think the existing Hive import mechanism can be used where applicable  (not 
sure if it works with all connection managers, but since the output is text 
format, the existing hive import support should help with that).  As you know, 
HCatalog uses hive metastore, so such tables are also available to HCatalog 
users.

Regarding running as part of the normal test suite, I totally understand.  I 
also did not want it to be a manual test.  If you look at the test utils, I use 
a mini cluster like MR job to do the loading into HCatalog and reading off 
HCatalog.   HCatalog does not have a HCatalogMiniCluster (for unit testing).  
When I first tried to run everything in local mode (which is supported), then 
the hive tests failed (because we depend on lack of some classes to distinguish 
between hive external cli or in process invocation).  That is why I had to 
exclude some of the Hive classes from the dependencies to make it run all unit 
tests.  Let me see if there is a way to accommodate both use cases (by 
introducing additional test parameters to force external Hive CLI usage that 
uses the mock Hive utils that we have in the unit test framework) and still get 
HCatalog run as part of the unit tests.

Thanks
Venkat
                
> Integrate HCatalog with Sqoop
> -----------------------------
>
>                 Key: SQOOP-931
>                 URL: https://issues.apache.org/jira/browse/SQOOP-931
>             Project: Sqoop
>          Issue Type: New Feature
>    Affects Versions: 1.4.2, 1.4.3
>         Environment: All 1.x sqoop version
>            Reporter: Venkat Ranganathan
>            Assignee: Venkat Ranganathan
>         Attachments: SQOOP-931.patch, SQOOP HCatalog Integration.pdf
>
>
>  Apache HCatalog is a table and storage management service that provides a 
> shared schema, data types and table abstraction freeing users from being 
> concerned about where or how their data is stored.  It provides 
> interoperability across  Pig, Map Reduce, and Hive.
> A sqoop hcatalog connector will help in supporting storage formats that are 
> abstracted by HCatalog.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to