[ https://issues.apache.org/jira/browse/HADOOP-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12709071#action_12709071 ]
Aaron Kimball commented on HADOOP-5815: --------------------------------------- Hi Noble, I've read through your document there and the related JIRA item in Solr. I'm a bit confused as to how it is applicable here -- maybe you could explain further. As I understand it, The DataImportHandler is designed to ingest data from various sources in a manner that is user-configured on a per-table basis, and incorporate that data into indices that are then readable from the rest of the Solr system. (disclaimer: I have very little understanding of Solr's goals and features. As I understand it, it's a search engine front-end.) Sqoop's goal (already met by this implementation) is to do ad-hoc loading of database tables into HDFS by performing a straightforward translation of rows to text while physically moving the bits from the database into flat files in HDFS. HDFS does not naturally include any indexing or other higher-level structures over a data set. Can you please explain further where you see integration points between these two tools? Thanks! > Sqoop: A database import tool for Hadoop > ---------------------------------------- > > Key: HADOOP-5815 > URL: https://issues.apache.org/jira/browse/HADOOP-5815 > Project: Hadoop Core > Issue Type: New Feature > Reporter: Aaron Kimball > Assignee: Aaron Kimball > Attachments: HADOOP-5815.patch > > > Overview: > Sqoop is a tool designed to help users import existing relational databases > into their Hadoop clusters. Sqoop uses JDBC to connect to a database, examine > the schema for tables, and auto-generate the necessary classes to import data > into HDFS. It then instantiates a MapReduce job to read the table from the > database via the DBInputFormat (JDBC-based InputFormat). The table is read > into a set of files loaded into HDFS. Both SequenceFile and text-based > targets are supported. > Longer term, Sqoop will support automatic connectivity to Hive, with the > ability to load data files directly into the Hive warehouse directory, and > also to inject the appropriate table definition into the metastore. > Some more specifics: > Sqoop is a program implemented as a contrib module. Its frontend is invoked > through "bin/hadoop jar sqoop.jar ..." and allows you to connect to arbitrary > JDBC databases and extract their tables into files in HDFS. The underlying > implementation utilizes the JDBC interface of HADOOP-2536 (DBInputFormat). > The DBWritable implementation needed to extract a table is generated by this > tool, based on the types of the columns seen in the table. Sqoop uses JDBC to > examine the table specification and translate this to the appropriate Java > types. > The generated classes are provided as .java files for the user to reuse. They > are also compiled into a jar and used to run a MapReduce task to perform the > data import. This either results in text files or SequenceFiles in HDFS. In > the latter case, these Java classes are embedded into the SequenceFiles as > well. > The program will extract a specific table from a database, or optionally, all > tables. For a table, it can read all columns, or just a subset. Since > HADOOP-2536 requires that a sorting key be specified for the import task, > Sqoop will auto-detect the presence of a primary key on a table and > automatically use it as the sort order; the user can also manually specify a > sorting column. > Example invocations: > To import an entire database: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --all-tables > (Requires that all tables have primary keys) > To select a single table: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --table employees > To select a subset of columns from a table: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --table employees --columns > "employee_id,first_name,last_name,salary,start_date" > To explicitly set the sort column, import format, and import destination (the > table will go to /shared/imported_databases/employees): > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:mysql://db.example.com/company --table employees --order-by employee_id > --warehouse-dir /shared/imported_databases --as-sequencefile > Sqoop will automatically select the correct JDBC driver class name for HSQLdb > and MySQL; this can also be explicitly set, e.g.: > hadoop jar sqoop.jar org.apache.hadoop.sqoop.Sqoop --connect > jdbc:postgresql://db.example.com/company --driver org.postgresql.Driver > --all-tables > Testing has been conducted with HSQLDB and MySQL. A set of unit tests covers > a great deal of Sqoop's functionality, and this tool has been used in > practice at Cloudera and with a few other early test users on "real" > databases. > A readme file is included in the patch which contains documentation on how to > use the tool. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.