[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Robert Chansler (JIRA) Tue, 29 Sep 2009 15:06:35 -0700

     [ 
https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Chansler updated HADOOP-5815:
------------------------------------

     Description: 
Overview:

Sqoop is a tool designed to help users import existing relational databases 
into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine 
the schema for tables, and auto-generate the necessary classes to import data 
into HDFS. It then instantiates a MapReduce job to read the table from the 
database via the DBInputFormat (JDBC-based InputFormat). The table is read into 
a set of files loaded into HDFS. Both SequenceFile and text-based targets are 
supported.

Longer term, Sqoop will support automatic connectivity to Hive, with the 
ability to load data files directly into the Hive warehouse directory, and also 
to inject the appropriate table definition into the metastore.

Some more specifics:

Sqoop is a program implemented as a contrib module. Its frontend is invoked 
through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary 
JDBC databases and extract their tables into files in HDFS. The underlying 
implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The 
DBWritable implementation needed to extract a table is generated by this tool, 
based on the types of the columns seen in the table. Sqoop uses JDBC to examine 
the table specification and translate this to the appropriate Java types.

The generated classes are provided as .java files for the user to reuse. They 
are also compiled into a jar and used to run a MapReduce task to perform the 
data import. This either results in text files or SequenceFiles in HDFS. In the 
latter case, these Java classes are embedded into the SequenceFiles as well.

The program will extract a specific table from a database, or optionally, all 
tables. For a table, it can read all columns, or just a subset. Since 
HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop 
will auto-detect the presence of a primary key on a table and automatically use 
it as the sort order; the user can also manually specify a sorting column.

Example invocations:

To import an entire database:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --all-tables

(Requires that all tables have primary keys)

To select a single table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --table employees

To select a subset of columns from a table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --table employees --columns 
"employee_id,first_name,last_name,salary,start_date"

To explicitly set the sort column, import format, and import destination (the 
table will go to /shared/imported_databases/employees):

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --table employees --order-by employee_id 
--warehouse-dir /shared/imported_databases --as-sequencefile

Sqoop will automatically select the correct JDBC driver class name for HSQLdb 
and MySQL; this can also be explicitly set, e.g.:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver 
--all-tables


Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a 
great deal of Sqoop's functionality, and this tool has been used in practice at 
Cloudera and with a few other early test users on "real" databases.

A readme file is included in the patch which contains documentation on how to 
use the tool.



  was:

Overview:

Sqoop is a tool designed to help users import existing relational databases 
into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine 
the schema for tables, and auto-generate the necessary classes to import data 
into HDFS. It then instantiates a MapReduce job to read the table from the 
database via the DBInputFormat (JDBC-based InputFormat). The table is read into 
a set of files loaded into HDFS. Both SequenceFile and text-based targets are 
supported.

Longer term, Sqoop will support automatic connectivity to Hive, with the 
ability to load data files directly into the Hive warehouse directory, and also 
to inject the appropriate table definition into the metastore.

Some more specifics:

Sqoop is a program implemented as a contrib module. Its frontend is invoked 
through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary 
JDBC databases and extract their tables into files in HDFS. The underlying 
implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). The 
DBWritable implementation needed to extract a table is generated by this tool, 
based on the types of the columns seen in the table. Sqoop uses JDBC to examine 
the table specification and translate this to the appropriate Java types.

The generated classes are provided as .java files for the user to reuse. They 
are also compiled into a jar and used to run a MapReduce task to perform the 
data import. This either results in text files or SequenceFiles in HDFS. In the 
latter case, these Java classes are embedded into the SequenceFiles as well.

The program will extract a specific table from a database, or optionally, all 
tables. For a table, it can read all columns, or just a subset. Since 
HADOOP-2536 requires that a sorting key be specified for the import task, Sqoop 
will auto-detect the presence of a primary key on a table and automatically use 
it as the sort order; the user can also manually specify a sorting column.

Example invocations:

To import an entire database:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --all-tables

(Requires that all tables have primary keys)

To select a single table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --table employees

To select a subset of columns from a table:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --table employees --columns 
"employee_id,first_name,last_name,salary,start_date"

To explicitly set the sort column, import format, and import destination (the 
table will go to /shared/imported_databases/employees):

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:mysql://db.example.com/company --table employees --order-by employee_id 
--warehouse-dir /shared/imported_databases --as-sequencefile

Sqoop will automatically select the correct JDBC driver class name for HSQLdb 
and MySQL; this can also be explicitly set, e.g.:

hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver 
--all-tables


Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers a 
great deal of Sqoop's functionality, and this tool has been used in practice at 
Cloudera and with a few other early test users on "real" databases.

A readme file is included in the patch which contains documentation on how to 
use the tool.



    Release Note: New contribution Sqoop is a JDBC-based database import tool 
for Hadoop.  (was: Sqoop: A JDBC-based database import tool for Hadoop)

Editorial pass over all release notes prior to publication of 0.21.

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Common
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>             Fix For: 0.21.0
>
>         Attachments: HADOOP-5815.2.patch, HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases 
> into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine 
> the schema for tables, and auto-generate the necessary classes to import data 
> into HDFS. It then instantiates a MapReduce job to read the table from the 
> database via the DBInputFormat (JDBC-based InputFormat). The table is read 
> into a set of files loaded into HDFS. Both SequenceFile and text-based 
> targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the 
> ability to load data files directly into the Hive warehouse directory, and 
> also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked 
> through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary 
> JDBC databases and extract their tables into files in HDFS. The underlying 
> implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). 
> The DBWritable implementation needed to extract a table is generated by this 
> tool, based on the types of the columns seen in the table. Sqoop uses JDBC to 
> examine the table specification and translate this to the appropriate Java 
> types.
> The generated classes are provided as .java files for the user to reuse. They 
> are also compiled into a jar and used to run a MapReduce task to perform the 
> data import. This either results in text files or SequenceFiles in HDFS. In 
> the latter case, these Java classes are embedded into the SequenceFiles as 
> well.
> The program will extract a specific table from a database, or optionally, all 
> tables. For a table, it can read all columns, or just a subset. Since 
> HADOOP-2536 requires that a sorting key be specified for the import task, 
> Sqoop will auto-detect the presence of a primary key on a table and 
> automatically use it as the sort order; the user can also manually specify a 
> sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees --columns 
> "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the 
> table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees --order-by employee_id 
> --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb 
> and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver 
> --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers 
> a great deal of Sqoop's functionality, and this tool has been used in 
> practice at Cloudera and with a few other early test users on "real" 
> databases.
> A readme file is included in the patch which contains documentation on how to 
> use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Reply via email to