[jira] Issue Comment Edited: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Noble Paul (JIRA) Tue, 12 May 2009 23:19:13 -0700

    [ 
https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708779#action_12708779
 ]


Noble Paul edited comment on HADOOP-5815 at 5/12/09 11:18 PM:
--------------------------------------------------------------

there is a tool called DataImportHandler which is used to import data from 
RDBMS , http urls etc which is successfully used in Solr. If necessary we can 
reuse large parts of it. 

http://wiki.apache.org/solr/DataImportHandler

There is a plan to make it available as a library which can be used to import 
into any kind of document database solr/couchdb/hadoop etc

      was (Author: noble.paul):
    there is a tool called DataImportHandler which is used to import data from 
RDBMS , http urls etc which is successfully used in Solr. If necessary we can 
reuse large parts of it. 

http://wiki.apache.org/solr/DataImportHandler
  
> Sqoop: A database import tool for Hadoop
> ----------------------------------------
>
>                 Key: HADOOP-5815
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5815
>             Project: Hadoop Core
>          Issue Type: New Feature
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: HADOOP-5815.patch
>
>
> Overview:
> Sqoop is a tool designed to help users import existing relational databases 
> into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine 
> the schema for tables, and auto-generate the necessary classes to import data 
> into HDFS. It then instantiates a MapReduce job to read the table from the 
> database via the DBInputFormat (JDBC-based InputFormat). The table is read 
> into a set of files loaded into HDFS. Both SequenceFile and text-based 
> targets are supported.
> Longer term, Sqoop will support automatic connectivity to Hive, with the 
> ability to load data files directly into the Hive warehouse directory, and 
> also to inject the appropriate table definition into the metastore.
> Some more specifics:
> Sqoop is a program implemented as a contrib module. Its frontend is invoked 
> through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary 
> JDBC databases and extract their tables into files in HDFS. The underlying 
> implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). 
> The DBWritable implementation needed to extract a table is generated by this 
> tool, based on the types of the columns seen in the table. Sqoop uses JDBC to 
> examine the table specification and translate this to the appropriate Java 
> types.
> The generated classes are provided as .java files for the user to reuse. They 
> are also compiled into a jar and used to run a MapReduce task to perform the 
> data import. This either results in text files or SequenceFiles in HDFS. In 
> the latter case, these Java classes are embedded into the SequenceFiles as 
> well.
> The program will extract a specific table from a database, or optionally, all 
> tables. For a table, it can read all columns, or just a subset. Since 
> HADOOP-2536 requires that a sorting key be specified for the import task, 
> Sqoop will auto-detect the presence of a primary key on a table and 
> automatically use it as the sort order; the user can also manually specify a 
> sorting column.
> Example invocations:
> To import an entire database:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --all-tables
> (Requires that all tables have primary keys)
> To select a single table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees
> To select a subset of columns from a table:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees --columns 
> "employee_id,first_name,last_name,salary,start_date"
> To explicitly set the sort column, import format, and import destination (the 
> table will go to /shared/imported_databases/employees):
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:mysql://db.example.com/company --table employees --order-by employee_id 
> --warehouse-dir /shared/imported_databases --as-sequencefile
> Sqoop will automatically select the correct JDBC driver class name for HSQLdb 
> and MySQL; this can also be explicitly set, e.g.:
> hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect 
> jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver 
> --all-tables
> Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers 
> a great deal of Sqoop's functionality, and this tool has been used in 
> practice at Cloudera and with a few other early test users on "real" 
> databases.
> A readme file is included in the patch which contains documentation on how to 
> use the tool.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (HADOOP-5815) Sqoop: A database import tool for Hadoop

Reply via email to