Markus Kemper created SQOOP-2874:
------------------------------------
Summary: Highlight Sqoop import with --as-parquetfile use cases
(Dataset name <NAME> is not alphanumeric (plus '_'))
Key: SQOOP-2874
URL: https://issues.apache.org/jira/browse/SQOOP-2874
Project: Sqoop
Issue Type: Improvement
Components: docs
Reporter: Markus Kemper
Hello Sqoop Community,
Would it be possible to request some documentation enhancements?
The ask is here is to proactively help raise awareness and improve user
experience with a few specific use cases [1] where some Sqoop commands have
restricted character options when using import with --as-parquetfile.
My understanding is Sqoop1 currently relies on Kite Datasets to write Parquet
files. From the Kite documentation [3] we see that to ensure compatibility
(with Hive, etc.), Kite imposes some restrictions on Names and Namespaces which
bubble up in Sqoop.
The following Sqoop use cases when using import with --as-parquetfile result in
the error [2] below. Full tests cases for each scenario are attached. If it
is an option to enhance the Sqoop documentation for these use cases I am happy
to provide proposed changes, let me know.
[1] Use Cases:
1. sqoop import --as-parquetfile + --target-dir /<path>/<rdbms_database>.<table>
1.1. The '.' is not allowed
2. sqoop import --as-parquetfile + --table <rdbms_database>.<table> + (no
--target-dir)
2.1. The '.' is not allowed, this is essentially the same as (1)
3. sqoop import --as-parquetfile + --hive-import --table
<hive_database>.<table>
3.1. The proper usage is to use --hive-database with --hive-table however with
--as-textfile --hive-table works with <hive_database>.<table>
[2] Kite Error:
16/03/06 08:45:56 ERROR sqoop.Sqoop: Got exception running Sqoop:
org.kitesdk.data.ValidationException: Dataset name DATABASE.TABLE is not
alphanumeric (plus '_')
org.kitesdk.data.ValidationException: Dataset name DATABASE.TABLE is not
alphanumeric (plus '_')
at
org.kitesdk.data.ValidationException.check(ValidationException.java:55)
at
org.kitesdk.data.spi.Compatibility.checkDatasetName(Compatibility.java:105)
at org.kitesdk.data.spi.Compatibility.check(Compatibility.java:68)
at
org.kitesdk.data.spi.filesystem.FileSystemMetadataProvider.create(FileSystemMetadataProvider.java:209)
at
org.kitesdk.data.spi.filesystem.FileSystemDatasetRepository.create(FileSystemDatasetRepository.java:137)
at org.kitesdk.data.Datasets.create(Datasets.java:239)
at org.kitesdk.data.Datasets.create(Datasets.java:307)
at
org.apache.sqoop.mapreduce.ParquetJob.createDataset(ParquetJob.java:141)
at
org.apache.sqoop.mapreduce.ParquetJob.configureImportJob(ParquetJob.java:119)
at
org.apache.sqoop.mapreduce.DataDrivenImportJob.configureMapper(DataDrivenImportJob.java:130)
at
org.apache.sqoop.mapreduce.ImportJobBase.runImport(ImportJobBase.java:260)
at org.apache.sqoop.manager.SqlManager.importTable(SqlManager.java:673)
at
org.apache.sqoop.manager.OracleManager.importTable(OracleManager.java:444)
at org.apache.sqoop.tool.ImportTool.importTable(ImportTool.java:497)
at org.apache.sqoop.tool.ImportTool.run(ImportTool.java:605)
at org.apache.sqoop.Sqoop.run(Sqoop.java:143)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.sqoop.Sqoop.runSqoop(Sqoop.java:179)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:218)
at org.apache.sqoop.Sqoop.runTool(Sqoop.java:227)
at org.apache.sqoop.Sqoop.main(Sqoop.java:236)
[3] Kite Documenation:
http://kitesdk.org/docs/1.0.0/introduction-to-datasets.html
Names and Namespaces
URIs also define a name and namespace for your dataset. Kite uses these values
when the underlying system has the same concept (for example, Hive). The name
and namespace are typically the last two values in a URI. For example, if you
create a dataset using the URI dataset:hive:fact_tables/ratings, Kite stores a
Hive table ratings in the fact_tables Hive database. If you create a dataset
using the URI dataset:hdfs:/user/cloudera/fact_tables/ratings, Kite stores an
HDFS dataset named ratings in the fact_tables namespace. To ensure
compatibility with Hive and other underlying systems, names and namespaces in
URIs must be made of alphanumeric or underscore (_) characters and cannot start
with a number.
Thanks, Markus
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)