[
https://issues.apache.org/jira/browse/CASSANDRA-10835?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15049381#comment-15049381
]
Joshua McKenzie commented on CASSANDRA-10835:
---------------------------------------------
+1 to both revert and, if so inclined, add 2nd option w/params in MB.
If we go with both, I think the new / MB-based param should supersede but be
disabled by default.
> CqlInputFormat creates too small splits for map Hadoop tasks
> -------------------------------------------------------------
>
> Key: CASSANDRA-10835
> URL: https://issues.apache.org/jira/browse/CASSANDRA-10835
> Project: Cassandra
> Issue Type: Bug
> Reporter: Artem Aliev
> Attachments: cassandra-3.0.1-10835.txt
>
>
> CqlInputFormat use number of rows in C* version < 2.2 to define split size
> The default split size was 64K rows.
> {code}
> private static final int DEFAULT_SPLIT_SIZE = 64 * 1024;
> {code}
> The doc:
> {code}
> * You can also configure the number of rows per InputSplit with
> * ConfigHelper.setInputSplitSize. The default split size is 64k rows.
> {code}
> New split algorithm assumes that SPLIT size is in bytes, so it creates really
> small map hadoop tasks by default (or with old configs).
> There two way to fix it:
> 1. Update the doc and increase default value to something like 16MB
> 2. Make the C* to be compatible with older version.
> I like the second options, as it will not surprise people who upgrade from
> old versions. I do not expect a lot of new user that will use Hadoop.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)