fuggy_yama created CASSANDRA-9074:
-------------------------------------

             Summary: Hadoop Cassandra CqlInputFormat pagination - not reading 
all input rows
                 Key: CASSANDRA-9074
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9074
             Project: Cassandra
          Issue Type: Bug
          Components: Hadoop
         Environment: Cassandra 2.0.11, Hadoop 1.0.4, Datastax java 
cassandra-driver-core 2.1.4
            Reporter: fuggy_yama
            Priority: Minor


I have a 7-node Cassandra (v2.0.11) cluster and a table with 10k rows. I run a 
hadoop job (datanodes reside on cassandra nodes of course) that reads data from 
that table and I see that only 7k rows is read to map phase.

I checked CqlInputFormat source code and noticed that a CQL query is build to 
select node-local date and also LIMIT clause is added (1k default). So that 7k 
read rows can be explained:
7 nodes * 1k limit = 7k rows read total

The limit can be changed using CqlConfigHelper:

CqlConfigHelper.setInputCQLPageRowSize(job.getConfiguration(), "1000");
Please help me with questions below: 
Is this a desired behavior? 
Why CqlInputFormat does not page through the rest of rows? 
Is it a bug or should I just increase the InputCQLPageRowSize value? 
What if I want to read all data in table and do not know the row count?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to