[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Aaron Kimball (JIRA) Wed, 13 May 2009 11:55:12 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709071#action_12709071
 ]


Aaron Kimball commented on HADOOP-5815:
---------------------------------------

Hi Noble,

I've read through your document there and the related JIRA item in Solr. I'm a 
bit confused as to how it is applicable here -- maybe you could explain 
further. As I understand it, The DataImportHandler is designed to ingest data 
from various sources in a manner that is user-configured on a per-table basis, 
and incorporate that data into indices that are then readable from the rest of 
the Solr system. (disclaimer: I have very little understanding of Solr's goals 
and features. As I understand it, it's a search engine front-end.)

Sqoop's goal (already met by this implementation) is to do ad-hoc loading of 
database tables into HDFS by performing a straightforward translation of rows 
to text while physically moving the bits from the database into flat files in 
HDFS. HDFS does not naturally include any indexing or other higher-level 
structures over a data set. 

Can you please explain further where you see integration points between these 
two tools? Thanks!

> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases 
> into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine 
> the schema for tables, and auto-generate the necessary classes to import data 
> into HDFS. It then instantiates a MapReduce job to read the table from the 
> database via the DBInputFormat (JDBC-based InputFormat). The table is read 
> into a set of files loaded into HDFS. Both SequenceFile and text-based 
> targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the 
> ability to load data files directly into the Hive warehouse directory, and 
> also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked 
> through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary 
> JDBC databases and extract their tables into files in HDFS. The underlying 
> implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). 
> The DBWritable implementation needed to extract a table is generated by this 
> tool, based on the types of the columns seen in the table. Sqoop uses JDBC to 
> examine the table specification and translate this to the appropriate Java 
> types.
> The generated classes are provided as .java files for the user to reuse. They 
> are also compiled into a jar and used to run a MapReduce task to perform the 
> data import. This either results in text files or SequenceFiles in HDFS. In 
> the latter case, these Java classes are embedded into the SequenceFiles as 
> well.
> The program will extract a specific table from a database, or optionally, all 
> tables. For a table, it can read all columns, or just a subset. Since 
> HADOOP-2536 requires that a sorting key be specified for the import task, 
> Sqoop will auto-detect the presence of a primary key on a table and 
> automatically use it as the sort order; the user can also manually specify a 
> sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees --columns 
> "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the 
> table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees --order-by employee_id 
> --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb 
> and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver 
> --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers 
> a great deal of Sqoop's functionality, and this tool has been used in 
> practice at Cloudera and with a few other early test users on "real" 
> databases.
> A readme file is included in the patch which contains documentation on how to 
> use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Reply via email to