[jira] [Created] (CASSANDRA-10835) CqlInputFormat creates too small splits for map Hadoop tasks

Artem Aliev (JIRA) Wed, 09 Dec 2015 09:19:26 -0800

Artem Aliev created CASSANDRA-10835:
---------------------------------------


             Summary: CqlInputFormat  creates too small splits for map Hadoop 
tasks
                 Key: CASSANDRA-10835
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-10835
             Project: Cassandra
          Issue Type: Bug
            Reporter: Artem Aliev


CqlInputFormat use number of rows in C* version < 2.2 to define split size
The default split size was 64K rows.
{code}
    private static final int DEFAULT_SPLIT_SIZE = 64 * 1024;
{code}

The doc:
{code}
* You can also configure the number of rows per InputSplit with
 *   ConfigHelper.setInputSplitSize. The default split size is 64k rows.
 {code}

New split algorithm assumes that SPLIT size is in bytes, so it creates really 
small map hadoop tasks by default (or with old configs).

There two way to fix it:
1. Update the doc and increase default value to something like 16MB
2. Make the C* to be compatible with older version.

I like the second options, as it will not surprise people who upgrade from old 
versions. I do not expect a lot of new user that will use Hadoop.







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (CASSANDRA-10835) CqlInputFormat creates too small splits for map Hadoop tasks

Reply via email to