[
https://issues.apache.org/jira/browse/CASSANDRA-9074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
fuggy_yama updated CASSANDRA-9074:
----------------------------------
Description:
I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a
hadoop job (datanodes reside on cassandra nodes of course) that reads data from
that table and I see that only 7k rows is read to map phase.
I checked CqlInputFormat source code and noticed that a CQL query is build to
select node-local date and also LIMIT clause is added (1k default). So that 7k
read rows can be explained:
7 nodes * 1k limit = 7k rows read total
The limit can be changed using CqlConfigHelper:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?
What if the amount of rows I need to read per cassandra node is very large - in
other words how to avoid OOM when setting InputCQLPageRowSize very large to
handle all data?
was:
I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a
hadoop job (datanodes reside on cassandra nodes of course) that reads data from
that table and I see that only 7k rows is read to map phase.
I checked CqlInputFormat source code and noticed that a CQL query is build to
select node-local date and also LIMIT clause is added (1k default). So that 7k
read rows can be explained:
7 nodes * 1k limit = 7k rows read total
The limit can be changed using CqlConfigHelper:
CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below:
Is this a desired behavior?
Why CqlInputFormat does not page through the rest of rows?
Is it a bug or should I just increase the InputCQLPageRowSize value?
What if I want to read all data in table and do not know the row count?
> Hadoop Cassandra CqlInputFormat pagination - not reading all input rows
> -----------------------------------------------------------------------
>
> Key: CASSANDRA-9074
> URL: https://issues.apache.org/jira/browse/CASSANDRA-9074
> Project: Cassandra
> Issue Type: Bug
> Components: Hadoop
> Environment: Cassandra 2.0.11, Hadoop 1.0.4, Datastax java
> cassandra-driver-core 2.1.4
> Reporter: fuggy_yama
> Priority: Minor
>
> I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run
> a hadoop job (datanodes reside on cassandra nodes of course) that reads data
> from that table and I see that only 7k rows is read to map phase.
> I checked CqlInputFormat source code and noticed that a CQL query is build to
> select node-local date and also LIMIT clause is added (1k default). So that
> 7k read rows can be explained:
> 7 nodes * 1k limit = 7k rows read total
> The limit can be changed using CqlConfigHelper:
> CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
> Please help me with questions below:
> Is this a desired behavior?
> Why CqlInputFormat does not page through the rest of rows?
> Is it a bug or should I just increase the InputCQLPageRowSize value?
> What if I want to read all data in table and do not know the row count?
> What if the amount of rows I need to read per cassandra node is very large -
> in other words how to avoid OOM when setting InputCQLPageRowSize very large
> to handle all data?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)