Manish created CASSANDRA-8404:
---------------------------------

             Summary: CQLSSTableLoader can not create SSTable for csv file of 
10M rows.
                 Key: CASSANDRA-8404
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8404
             Project: Cassandra
          Issue Type: Bug
         Environment: I am using Cassandra 2.1.1 on 32 bit Ubuntu 12.04. I am 
running the program with -Xmx1000M
manish@manish[~]:> uname -a
Linux manish 3.2.0-72-generic-pae #107-Ubuntu SMP Thu Nov 6 14:44:10 UTC 2014 
i686 i686 i386 GNU/Linux

            Reporter: Manish
         Attachments: Test1.java, cassandra.yaml

I am able to create SSTable for one file of 10M rows but not for other file. 
The data file which works is subscribers1.gz and data file which does not work 
is subscriber2.gz. Both files have same values in first column but different 
values for second column. I wonder why CQLSSTableLoader does not work for 
different set of data. 
Program expected unzipped txt files. So please unzip files before running 
program. What I have observed is High GC when program processes around 5.2M 
lines of file subscriber2.gz. It is able to process till 5.8M lines with very 
frequent Full GC runs. It is not able to process beyond 5.8M rows because of 
memory not being available.
I have attached Test1.java and cassandra.yaml I used for creating sstable. In 
classpath I am specifying all jars of lib folder of extracted 
apache-cassandra-2.1.1-bin.tar.gz 

Jira does not allow a file of size greater than 10 MB. So I am sharing data 
files in google drive.
link to download subscribers1.gz
https://drive.google.com/file/d/0B6_-ugKWlrfoOTRTa2FCNTFWU2c/view?usp=sharing

link to download subscribers2.gz
https://drive.google.com/file/d/0B6_-ugKWlrfocndycm9yM21rN0E/view?usp=sharing




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to