Markus Kemper created SQOOP-2874:
------------------------------------

             Summary: Highlight Sqoop import with --as-parquetfile use cases 
(Dataset name <NAME> is not alphanumeric (plus '_'))
                 Key: SQOOP-2874
                 URL: https://issues.apache.org/jira/browse/SQOOP-2874
             Project: Sqoop
          Issue Type: Improvement
          Components: docs
            Reporter: Markus Kemper


Hello Sqoop Community,

Would it be possible to request some documentation enhancements?

The ask is here is to proactively help raise awareness and improve user 
experience with a few specific use cases [1] where some Sqoop commands have 
restricted character options when using import with --as-parquetfile.  

My understanding is Sqoop1 currently relies on Kite Datasets to write Parquet 
files.  From the Kite documentation [3] we see that to ensure compatibility 
(with Hive, etc.), Kite imposes some restrictions on Names and Namespaces which 
bubble up in Sqoop.

The following Sqoop use cases when using import with --as-parquetfile result in 
the error [2] below.  Full tests cases for each scenario are attached.  If it 
is an option to enhance the Sqoop documentation for these use cases I am happy 
to provide proposed changes, let me know.

[1] Use Cases:
1. sqoop import --as-parquetfile + --target-dir /<path>/<rdbms_database>.<table>
1.1. The '.' is not allowed
2. sqoop import --as-parquetfile + --table <rdbms_database>.<table>  + (no 
--target-dir)
2.1. The '.' is not allowed, this is essentially the same as (1)
3. sqoop import --as-parquetfile + --hive-import --table 
<hive_database>.<table> 
3.1. The proper usage is to use --hive-database with --hive-table however with 
--as-textfile --hive-table works with <hive_database>.<table>

[2] Kite Error:
16/03/06 08:45:56 ERROR sqoop.Sqoop: Got exception running Sqoop: 
org.kitesdk.data.ValidationException: Dataset name DATABASE.TABLE is not 
alphanumeric (plus '_')
org.kitesdk.data.ValidationException: Dataset name DATABASE.TABLE is not 
alphanumeric (plus '_')
        at 
org.kitesdk.data.ValidationException.check(ValidationException.java:55)
        at 
org.kitesdk.data.spi.Compatibility.checkDatasetName(Compatibility.java:105)
        at org.kitesdk.data.spi.Compatibility.check(Compatibility.java:68)
        at 
org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:209)
        at 
org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)
        at org.kitesdk.data.Datasets.create(Datasets.java:239)
        at org.kitesdk.data.Datasets.create(Datasets.java:307)
        at 
org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:141)
        at 
org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:119)
        at 
org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:130)
        at 
org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
        at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
        at 
org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:444)
        at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
        at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
        at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
        at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
        at org.apache.sqoop.Sqoop.main(Sqoop.java:236)

[3] Kite Documenation:
http://kitesdk.org/docs/1.0.0/introduction-to-datasets.html
Names and Namespaces
URIs also define a name and namespace for your dataset. Kite uses these values 
when the underlying system has the same concept (for example, Hive). The name 
and namespace are typically the last two values in a URI. For example, if you 
create a dataset using the URI dataset:hive:fact_tables/ratings, Kite stores a 
Hive table ratings in the fact_tables Hive database. If you create a dataset 
using the URI dataset:hdfs:/user/cloudera/fact_tables/ratings, Kite stores an 
HDFS dataset named ratings in the fact_tables namespace.  To ensure 
compatibility with Hive and other underlying systems, names and namespaces in 
URIs must be made of alphanumeric or underscore (_) characters and cannot start 
with a number.

Thanks, Markus



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to