[ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708779#action_12708779 ]
Noble Paul edited comment on HADOOP-5815 at 5/12/09 11:18 PM: -------------------------------------------------------------- there is a tool called DataImportHandler which is used to import data from RDBMS , http urls etc which is successfully used in Solr. If necessary we can reuse large parts of it. http://wiki.apache.org/solr/DataImportHandler There is a plan to make it available as a library which can be used to import into any kind of document database solr/couchdb/hadoop etc was (Author: noble.paul): there is a tool called DataImportHandler which is used to import data from RDBMS , http urls etc which is successfully used in Solr. If necessary we can reuse large parts of it. http://wiki.apache.org/solr/DataImportHandler > Sqoop: A database import tool for Hadoop > ---------------------------------------- > > Key: HADOOP-5815 > URL: https://issues.apache.org/jira/browse/HADOOP-5815 > Project: Hadoop Core > Issue Type: New Feature > Reporter: Aaron Kimball > Assignee: Aaron Kimball > Attachments: HADOOP-5815.patch > > > Overview: > Sqoop is a tool designed to help users import existing relational databases > into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine > the schema for tables, and auto-generate the necessary classes to import data > into HDFS. It then instantiates a MapReduce job to read the table from the > database via the DBInputFormat (JDBC-based InputFormat). The table is read > into a set of files loaded into HDFS. Both SequenceFile and text-based > targets are supported. > Longer term, Sqoop will support automatic connectivity to Hive, with the > ability to load data files directly into the Hive warehouse directory, and > also to inject the appropriate table definition into the metastore. > Some more specifics: > Sqoop is a program implemented as a contrib module. Its frontend is invoked > through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary > JDBC databases and extract their tables into files in HDFS. The underlying > implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). > The DBWritable implementation needed to extract a table is generated by this > tool, based on the types of the columns seen in the table. Sqoop uses JDBC to > examine the table specification and translate this to the appropriate Java > types. > The generated classes are provided as .java files for the user to reuse. They > are also compiled into a jar and used to run a MapReduce task to perform the > data import. This either results in text files or SequenceFiles in HDFS. In > the latter case, these Java classes are embedded into the SequenceFiles as > well. > The program will extract a specific table from a database, or optionally, all > tables. For a table, it can read all columns, or just a subset. Since > HADOOP-2536 requires that a sorting key be specified for the import task, > Sqoop will auto-detect the presence of a primary key on a table and > automatically use it as the sort order; the user can also manually specify a > sorting column. > Example invocations: > To import an entire database: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --all-tables > (Requires that all tables have primary keys) > To select a single table: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --table employees > To select a subset of columns from a table: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --table employees --columns > "employee_id,first_name,last_name,salary,start_date" > To explicitly set the sort column, import format, and import destination (the > table will go to /shared/imported_databases/employees): > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --table employees --order-by employee_id > --warehouse-dir /shared/imported_databases --as-sequencefile > Sqoop will automatically select the correct JDBC driver class name for HSQLdb > and MySQL; this can also be explicitly set, e.g.: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver > --all-tables > Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers > a great deal of Sqoop's functionality, and this tool has been used in > practice at Cloudera and with a few other early test users on "real" > databases. > A readme file is included in the patch which contains documentation on how to > use the tool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.