[
https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Aaron Kimball updated HADOOP-5815:
----------------------------------
Status: Patch Available (was: Open)
> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
> Key: HADOOP-5815
> URL: https://issues.apache.org/jira/browse/HADOOP-5815
> Project: Hadoop Core
> Issue Type: New Feature
> Reporter: Aaron Kimball
> Assignee: Aaron Kimball
> Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases
> into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine
> the schema for tables, and auto-generate the necessary classes to import data
> into HDFS. It then instantiates a MapReduce job to read the table from the
> database via the DBInputFormat (JDBC-based InputFormat). The table is read
> into a set of files loaded into HDFS. Both SequenceFile and text-based
> targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the
> ability to load data files directly into the Hive warehouse directory, and
> also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked
> through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary
> JDBC databases and extract their tables into files in HDFS. The underlying
> implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat).
> The DBWritable implementation needed to extract a table is generated by this
> tool, based on the types of the columns seen in the table. Sqoop uses JDBC to
> examine the table specification and translate this to the appropriate Java
> types.
> The generated classes are provided as .java files for the user to reuse. They
> are also compiled into a jar and used to run a MapReduce task to perform the
> data import. This either results in text files or SequenceFiles in HDFS. In
> the latter case, these Java classes are embedded into the SequenceFiles as
> well.
> The program will extract a specific table from a database, or optionally, all
> tables. For a table, it can read all columns, or just a subset. Since
> HADOOP-2536 requires that a sorting key be specified for the import task,
> Sqoop will auto-detect the presence of a primary key on a table and
> automatically use it as the sort order; the user can also manually specify a
> sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect
> jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect
> jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect
> jdbc:mysql://db.example.com/company --table employees --columns
> "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the
> table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect
> jdbc:mysql://db.example.com/company --table employees --order-by employee_id
> --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb
> and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect
> jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver
> --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers
> a great deal of Sqoop's functionality, and this tool has been used in
> practice at Cloudera and with a few other early test users on "real"
> databases.
> A readme file is included in the patch which contains documentation on how to
> use the tool.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.