More efficient SQL queries for DBInputFormat
--------------------------------------------

                 Key: MAPREDUCE-885
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
             Project: Hadoop Map/Reduce
          Issue Type: Improvement
            Reporter: Aaron Kimball
            Assignee: Aaron Kimball
         Attachments: MAPREDUCE-885.patch

DBInputFormat generates InputSplits by counting the available rows in a table, 
and selecting subsections of the table via the "LIMIT" and "OFFSET" SQL 
keywords. These are only meaningful in an ordered context, so the query also 
includes an "ORDER BY" clause on an index column. The resulting queries are 
often inefficient and require full table scans. Actually using multiple mappers 
with these queries can lead to O(n^2) behavior in the database, where n is the 
number of splits. Attempting to use parallelism with these queries is 
counter-productive.

A better mechanism is to organize splits based on data values themselves, which 
can be performed in the WHERE clause, allowing for index range scans of tables, 
and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to