[
https://issues.apache.org/jira/browse/CASSANDRA-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Alex Liu updated CASSANDRA-9074:
--------------------------------
Comment: was deleted
(was: Can you provide detail how to reproduce the issue like. Table schema,
data and Hadoop query ... etc, so we can reproduce it and debug it. Does it
error out in a one node cluster?)
> Hadoop Cassandra CqlInputFormat pagination - not reading all input rows
> -----------------------------------------------------------------------
>
> Key: CASSANDRA-9074
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9074
> Project: Cassandra
> Issue Type: Bug
> Components: Hadoop
> Environment: Cassandra 2.0.11, Hadoop 1.0.4, Datastax java
> cassandra-driver-core 2.1.4
> Reporter: fuggy_yama
> Assignee: Alex Liu
> Priority: Minor
> Fix For: 2.0.15
>
>
> I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run
> a hadoop job (datanodes reside on cassandra nodes of course) that reads data
> from that table and I see that only 7k rows is read to map phase.
> I checked CqlInputFormat source code and noticed that a CQL query is build to
> select node-local date and also LIMIT clause is added (1k default). So that
> 7k read rows can be explained:
> 7 nodes * 1k limit = 7k rows read total
> The limit can be changed using CqlConfigHelper:
> CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
> Please help me with questions below:
> Is this a desired behavior?
> Why CqlInputFormat does not page through the rest of rows?
> Is it a bug or should I just increase the InputCQLPageRowSize value?
> What if I want to read all data in table and do not know the row count?
> What if the amount of rows I need to read per cassandra node is very large -
> in other words how to avoid OOM when setting InputCQLPageRowSize very large
> to handle all data?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)