Re: Review Request 22516: Support importing mainframe sequential datasets

Gwen Shapira Wed, 06 Aug 2014 13:13:28 -0700


> On July 10, 2014, 8:22 a.m., Venkat Ranganathan wrote:
> > src/java/org/apache/sqoop/manager/MainframeManager.java, line 75
> > <https://reviews.apache.org/r/22516/diff/1/?file=608148#file608148line75>
> >
> >     Is import into Hbase and Accumulo supported by this tool?  It looks 
> > like the only target supported is HDFS text files from the command help.
> 
> Mariappan Asokan wrote:
>     Each record in a mainframe dataset is treated as a single field (or 
> column.)  So, theoretically HBase, Accumulo, and Hive are supported but with 
> limited usability.  So, I did not add them in the documentation.  If you feel 
> strongly that they should be documented, I can work on that in the next 
> version of the patch.
> 
> Venkat Ranganathan wrote:
>     I feel it would be good to say we import only as text files and leave 
> further processing, loading into hive/hbase upto the user as the composition 
> of the records and needed processing differ and the schema can't be inferred.
> 
> Mariappan Asokan wrote:
>     I agree with you.  To avoid confusion, I plan to remove support for 
> parsing input format, output format, hive, hbase, hcatalog, and codegen 
> options.  This will synchronize the document with the code. What do you think?
>
> 
> Venkat Ranganathan wrote:
>     Sorry for the delay.   I was wondering whether the mainframe connector 
> can just define connector specific extra args and not create another tool.   
> Please see NetezzaManager or DirectNetezzaManager as an example.   May be you 
> have to invent a new synthetic  URI format say jdbc:mfftp:<host 
> address>:<port>/dataset and choose your Connection Manager when --connect 
> option with the above uri format is given.  That should simplify a whole lot 
> in my opinion.   What do you think?
> 
> Mariappan Asokan wrote:
>     Thanks for your suggestions.  Sorry, I did not get back sooner.  In Sqoop 
> 1.x, there is a strong assumption that input source is always a database 
> table.  Due to this the sqoop import tool has many options that are relevant 
> to a source database table.  A mainframe source is totally different from a 
> database table.  I think it is better to create a separate tool for mainframe 
> import rather than just a new connection manager.  The mainframe import tool 
> will not support many options that the database import tool supports.  It 
> will have its own options that the database import tool does not support.  At 
> present, these are the host name and partitioned dataset name.  In the 
> future, the mainframe import tool may be enhanced with metadata specific or 
> connection specific arguments unique to mainframe.  Creating a synthetic URI 
> for a connection seems to be somewhat artificial to me.
>     
>     Contrary to what I stated before, considering possible future 
> enhancements, I think it is better to retain the support for parsing input 
> format, output format, Hive, HBase, HCatalog, and codegen options.  The 
> documentation will be enhanced in the future to reflect this support.
>
> 
> Venkat Ranganathan wrote:
>     Thanks for your thoughts on the suggestion.  As you correctly pointed 
> out, Sqoop 1.x has a JDBC model (that is why you had to implement  a 
> ConnectionManager and provide pseudo values for column types etc (always 
> returning VARCHAR).   I understand there will be options mainframe import 
> will not support (much like there are mysql specific options or netezza or 
> sqlserver specific options).   I understand you want to have specific 
> metadata for mainframe import.  That may be tricky.   Connection specific 
> arguments can be implemented as how JDBC connection specific arguments are 
> done.  
>     
>     The reason for my suggestion was primarily to piggy back on the 
> implementation for imports into hive/hbase in future when you have the 
> ability to provide specific metadata on the data.
>     You can definitely parse the various options, but you have to explicitly 
> check and exit if the unsupported options are currently used.
>     
>     My only worry with this tool is that this may be one off for mainframe 
> imports alone and we will be starting off with hdfs import only until you get 
> to the rest of the parts and when we finally see this, it is basically 
> duplicating some of the code and may be difficult to maintain,
>


I just checked the possibility of adding non-JDBC imports as part of the import 
tool, using fake connection URL as you suggested.
This is not feasible - ConnManager (which you need to inherit) has to implement 
getConnection, which returns java.sql.Connection. You can't return this 
connection object for an FTP. Same for readTable which must return a ResultSet. 

I think a separate tool is the only way to go.


- Gwen


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/22516/#review47555
-----------------------------------------------------------


On June 14, 2014, 10:46 p.m., Mariappan Asokan wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/22516/
> -----------------------------------------------------------
> 
> (Updated June 14, 2014, 10:46 p.m.)
> 
> 
> Review request for Sqoop.
> 
> 
> Repository: sqoop-trunk
> 
> 
> Description
> -------
> 
> This is to move mainframe datasets to Hadoop.
> 
> 
> Diffs
> -----
> 
>   src/java/org/apache/sqoop/manager/MainframeManager.java PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetFTPRecordReader.java 
> PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetImportMapper.java 
> PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetInputFormat.java 
> PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetInputSplit.java 
> PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeDatasetRecordReader.java 
> PRE-CREATION 
>   src/java/org/apache/sqoop/mapreduce/MainframeImportJob.java PRE-CREATION 
>   src/java/org/apache/sqoop/tool/MainframeImportTool.java PRE-CREATION 
>   src/java/org/apache/sqoop/tool/SqoopTool.java dbe429a 
>   src/java/org/apache/sqoop/util/MainframeFTPClientUtils.java PRE-CREATION 
>   src/test/org/apache/sqoop/manager/TestMainframeManager.java PRE-CREATION 
>   
> src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetFTPRecordReader.java 
> PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetInputFormat.java 
> PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeDatasetInputSplit.java 
> PRE-CREATION 
>   src/test/org/apache/sqoop/mapreduce/TestMainframeImportJob.java 
> PRE-CREATION 
>   src/test/org/apache/sqoop/tool/TestMainframeImportTool.java PRE-CREATION 
>   src/test/org/apache/sqoop/util/TestMainframeFTPClientUtils.java 
> PRE-CREATION 
> 
> Diff: https://reviews.apache.org/r/22516/diff/
> 
> 
> Testing
> -------
> 
> 
> Thanks,
> 
> Mariappan Asokan
> 
>

Re: Review Request 22516: Support importing mainframe sequential datasets

Reply via email to