[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

Hadoop QA (JIRA) Thu, 27 Aug 2009 19:57:25 -0700

    [ 
https://issues.apache.org/jira/browse/MAPREDUCE-885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12748655#action_12748655
 ]


Hadoop QA commented on MAPREDUCE-885:
-------------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12417849/MAPREDUCE-885.3.patch
  against trunk revision 808730.

    +1 @author.  The patch does not contain any @author tags.

    -1 tests included.  The patch doesn't appear to include any new or modified 
tests.
                        Please justify why no new tests are needed for this 
patch.
                        Also please list what manual steps were performed to 
verify this patch.

    -1 patch.  The patch command could not apply the patch.

Console output: 
http://hudson.zones.apache.org/hudson/job/Mapreduce-Patch-vesta.apache.org/532/console

This message is automatically generated.

> More efficient SQL queries for DBInputFormat
> --------------------------------------------
>
>                 Key: MAPREDUCE-885
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-885
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>            Reporter: Aaron Kimball
>            Assignee: Aaron Kimball
>         Attachments: MAPREDUCE-885.2.patch, MAPREDUCE-885.3.patch, 
> MAPREDUCE-885.patch
>
>
> DBInputFormat generates InputSplits by counting the available rows in a 
> table, and selecting subsections of the table via the "LIMIT" and "OFFSET" 
> SQL keywords. These are only meaningful in an ordered context, so the query 
> also includes an "ORDER BY" clause on an index column. The resulting queries 
> are often inefficient and require full table scans. Actually using multiple 
> mappers with these queries can lead to O(n^2) behavior in the database, where 
> n is the number of splits. Attempting to use parallelism with these queries 
> is counter-productive.
> A better mechanism is to organize splits based on data values themselves, 
> which can be performed in the WHERE clause, allowing for index range scans of 
> tables, and can better exploit parallelism in the database.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAPREDUCE-885) More efficient SQL queries for DBInputFormat

Reply via email to