[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats

2012-02-28 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218192#comment-13218192
 ] 

Samarth Gahire commented on CASSANDRA-3859:
---

Ok ,So I tested patch for BulkOutputFormat with time out of 30 seconds.
1) First of all I tried it without applying a patch and it was throwing a 
timeout exception.
2) Then I applied a patch and it worked properly.
It means with patch progress reporting is working but it is not reporting 
progress after every second while loading(can you explain this?). Because same 
patch throws timeout exception for 10 seconds of time out.

 Add Progress Reporting to Cassandra OutputFormats
 -

 Key: CASSANDRA-3859
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop, Tools
Affects Versions: 1.1.0
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, mapreduce, sstableloader
 Fix For: 1.1.0

 Attachments: 0001-add-progress-reporting-to-BOF.txt, 
 0002-Add-progress-to-CFOF.txt

   Original Estimate: 48h
  Remaining Estimate: 48h

 When we are using the BulkOutputFormat to load the data to cassandra. We 
 should use the progress reporting to Hadoop Job within Sstable loader because 
 while loading the data for particular task if streaming is taking more time 
 and progress is not reported to Job it may kill the task with timeout 
 exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats

2012-02-27 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217154#comment-13217154
 ] 

Samarth Gahire commented on CASSANDRA-3859:
---

I have tested patch on CDH2 and latest CDH3 and also on hadoop-0.20.203 . I had 
set the time out to 10 seconds.As you confirmed that  progress is reported 
after every second this job should not timed out for time out of 10 seconds.But 
it is still getting timed out for all the versions mentioned above.
Please test and let me know if I am missing something.

 Add Progress Reporting to Cassandra OutputFormats
 -

 Key: CASSANDRA-3859
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop, Tools
Affects Versions: 1.1.0
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, mapreduce, sstableloader
 Fix For: 1.1.0

 Attachments: 0001-add-progress-reporting-to-BOF.txt, 
 0002-Add-progress-to-CFOF.txt

   Original Estimate: 48h
  Remaining Estimate: 48h

 When we are using the BulkOutputFormat to load the data to cassandra. We 
 should use the progress reporting to Hadoop Job within Sstable loader because 
 while loading the data for particular task if streaming is taking more time 
 and progress is not reported to Job it may kill the task with timeout 
 exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats

2012-02-24 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215544#comment-13215544
 ] 

Samarth Gahire commented on CASSANDRA-3859:
---

@Brandon Williams : I just want to check my understanding that in a provided 
patch you are reporting a progress while streaming sstables after every 1000 
ms, right?

 Add Progress Reporting to Cassandra OutputFormats
 -

 Key: CASSANDRA-3859
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop, Tools
Affects Versions: 1.1.0
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, mapreduce, sstableloader
 Fix For: 1.1.0

 Attachments: 0001-add-progress-reporting-to-BOF.txt, 
 0002-Add-progress-to-CFOF.txt

   Original Estimate: 48h
  Remaining Estimate: 48h

 When we are using the BulkOutputFormat to load the data to cassandra. We 
 should use the progress reporting to Hadoop Job within Sstable loader because 
 while loading the data for particular task if streaming is taking more time 
 and progress is not reported to Job it may kill the task with timeout 
 exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats

2012-02-14 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207598#comment-13207598
 ] 

Samarth Gahire commented on CASSANDRA-3859:
---

Have not got the time to test it.Will let you know soon.

 Add Progress Reporting to Cassandra OutputFormats
 -

 Key: CASSANDRA-3859
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop, Tools
Affects Versions: 1.1.0
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, mapreduce, sstableloader
 Fix For: 1.1.0

 Attachments: 0001-add-progress-reporting-to-BOF.txt, 
 0002-Add-progress-to-CFOF.txt

   Original Estimate: 48h
  Remaining Estimate: 48h

 When we are using the BulkOutputFormat to load the data to cassandra. We 
 should use the progress reporting to Hadoop Job within Sstable loader because 
 while loading the data for particular task if streaming is taking more time 
 and progress is not reported to Job it may kill the task with timeout 
 exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.

2012-02-13 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206986#comment-13206986
 ] 

Samarth Gahire commented on CASSANDRA-3740:
---

First 4 patches working fine.
About the patch related to CASSANDRA-3839 Erik can explain properly.

 While using BulkOutputFormat  unneccessarily look for the cassandra.yaml file.
 --

 Key: CASSANDRA-3740
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Affects Versions: 1.1.0
Reporter: Samarth Gahire
Assignee: Brandon Williams
  Labels: cassandra, hadoop, mapreduce
 Fix For: 1.1.0

 Attachments: 0001-Make-DD-the-canonical-partitioner-source.txt, 
 0002-Prevent-loading-from-yaml.txt, 0003-use-output-partitioner.txt, 
 0004-update-BOF-for-new-dir-layout.txt, 0005-BWR-uses-any-if.txt


 I am trying to use BulkOutputFormat to stream the data from map of Hadoop 
 job. I have set the cassandra related configuration using ConfigHelper ,Also 
 have looked into Cassandra code seems Cassandra has taken care that it should 
 not look for the cassandra.yaml file.
 But still when I run the job i get the following error:
 {
 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1
 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015
 12/01/13 11:30:05 INFO mapred.JobClient:  map 0% reduce 0%
 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : 
 attempt_201201130910_0015_m_00_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
 attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml
 attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to 
 start server.
 }
 Also let me know how can i make this cassandra.yaml file available to Hadoop 
 mapreduce job?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats

2012-02-08 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203433#comment-13203433
 ] 

Samarth Gahire commented on CASSANDRA-3859:
---

I have just checked the patches and it seems that you have added the progress 
reporting while the generation of sstables (When the write method of the 
BulkRecorWriter is executed).
But in our case the timeout issue is because of the time taken for the 
streaming the sstables to the Cassandra (When the close() method of the 
BulkRecorWriter is executed).
When the SSTableLoader comes into the picture and start loading the sstables if 
the size of the sstables generated is big and it is taking more than 10 minutes 
to load(stream),I dont see any progress reporting there, and the task will fail 
of timed out.


 Add Progress Reporting to Cassandra OutputFormats
 -

 Key: CASSANDRA-3859
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop, Tools
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, mapreduce, sstableloader
 Fix For: 1.1

 Attachments: 0001-add-progress-reporting-to-BOF.txt, 
 0002-Add-progress-to-CFOF.txt

   Original Estimate: 48h
  Remaining Estimate: 48h

 When we are using the BulkOutputFormat to load the data to cassandra. We 
 should use the progress reporting to Hadoop Job within Sstable loader because 
 while loading the data for particular task if streaming is taking more time 
 and progress is not reported to Job it may kill the task with timeout 
 exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat

2012-02-06 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202090#comment-13202090
 ] 

Samarth Gahire commented on CASSANDRA-3859:
---

Yes I have been experiencing this issue and my job fails with following 
exception:


{code}
12/02/06 11:40:25 INFO mapred.JobClient: Task Id : 
attempt_201202061119_0001_m_01_1, Status : FAILED Task 
attempt_201202061119_0001_m_01_1 failed to report status for 601 seconds. 
Killing!
{code}

 Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat
 --

 Key: CASSANDRA-3859
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop, Tools
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, mapreduce, sstableloader
 Fix For: 1.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 When we are using the BulkOutputFormat to load the data to cassandra. We 
 should use the progress reporting to Hadoop Job within Sstable loader because 
 while loading the data for particular task if streaming is taking more time 
 and progress is not reported to Job it may kill the task with timeout 
 exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat

2012-02-06 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202107#comment-13202107
 ] 

Samarth Gahire commented on CASSANDRA-3859:
---

I think the getProgress() method you are talking about is not related to Hadoop 
progress reporting.
We should implement the hadoop progress reporting to resolve this problem.
I guess {code}org.apache.hadoop.mapreduce.TaskAttemptContext.progress(){code} 
is the method provided by Hadoop to report the progress.
I was trying to figure out the part of the code where it streams the sstables 
to the cassandra and taking time.I will try to report progress from that part 
of the code.Need your help to figure out the same.

 Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat
 --

 Key: CASSANDRA-3859
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859
 Project: Cassandra
  Issue Type: Improvement
  Components: Hadoop, Tools
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, mapreduce, sstableloader
 Fix For: 1.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 When we are using the BulkOutputFormat to load the data to cassandra. We 
 should use the progress reporting to Hadoop Job within Sstable loader because 
 while loading the data for particular task if streaming is taking more time 
 and progress is not reported to Job it may kill the task with timeout 
 exception. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3851) Wrong Keyspace name is generated while streaming the sstables using BulkOutputFormat.

2012-02-04 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200442#comment-13200442
 ] 

Samarth Gahire commented on CASSANDRA-3851:
---

Got It
Actually in cassandra-trunk we are handling it as 
{code}
File outputdir = new File(getOutputLocation() + File.separator + keyspace + 
File.separator + ConfigHelper.getOutputColumnFamily(conf)); //dir must be named 
by ks/cf for the loader
{code}
That is the reason it is creating the keyspace name properly.
So Its a bug in cassandra-1.1.

 Wrong Keyspace name is generated while streaming the sstables using 
 BulkOutputFormat.
 -

 Key: CASSANDRA-3851
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3851
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop, Tools
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
Priority: Minor
  Labels: bulkloader, hadoop, sstableloader
 Fix For: 1.1

   Original Estimate: 48h
  Remaining Estimate: 48h

 I have merge the committed changes of 
 [CASSANDRA-3828|https://issues.apache.org/jira/browse/CASSANDRA-3828] into my 
 cassadra-trunk. Also the changes for the OutputLocation.
 But when I tried to load the sstables with hadoop job it results into the 
 following exception:
 {code}
 12/02/04 11:19:12 INFO mapred.JobClient:  map 6% reduce 0%
 12/02/04 11:19:14 INFO mapred.JobClient: Task Id : 
 attempt_201202041114_0001_m_01_1, Status : FAILED
 java.lang.RuntimeException: Could not retrieve endpoint ranges:
 at 
 org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:252)
 at 
 org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:117)
 at 
 org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:112)
 at 
 org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:182)
 at 
 org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:167)
 at 
 org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:259)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 at org.apache.hadoop.mapred.Child.main(Child.java:253)
 Caused by: InvalidRequestException (*why:There is no ring for the keyspace: 
 tmp*)
 at 
 org.apache.cassandra.thrift.Cassandra$describe_ring_result.read(Cassandra.java:24053)
 at 
 org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78)
 at 
 org.apache.cassandra.thrift.Cassandra$Client.recv_describe_ring(Cassandra.java:1065)
 at 
 org.apache.cassandra.thrift.Cassandra$Client.describe_ring(Cassandra.java:1052)
 at 
 org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:225)
 ... 12 more
 {code}
 After looking into the code I figured out that as we are setting the 
 OUTPUTLOCATION with system property java.io.tmpdir the output directory is 
 getting created as: /tmp/Keyspace_Name
 So in SSTableLoader while generating the kespace name like
 {code}
 this.keyspace = directory.getParentFile().getName();
 {code}
 It is setting the keyspace name as tmp and results into the above exception.
 I have changed the code as:
 {code}this.keyspace = directory.getName();{code}
 and it works perfect.
 But I am wondering how it was working fine previously? Am I doing anything 
 wrong ? or is it a bug? 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.

2012-02-01 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197745#comment-13197745
 ] 

Samarth Gahire commented on CASSANDRA-3740:
---

I Tested the patches all the patches except the yaml one works fine.
I have checked the Yaml patch and my job still look for the cassandra.yaml file 
and fails.

 While using BulkOutputFormat  unneccessarily look for the cassandra.yaml file.
 --

 Key: CASSANDRA-3740
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
  Labels: cassandra, hadoop, mapreduce
 Fix For: 1.1

 Attachments: 0001-Make-DD-the-canonical-partitioner-source.txt, 
 0002-Prevent-loading-from-yaml.txt, 0003-use-output-partitioner.txt, 
 0004-update-BOF-for-new-dir-layout.txt


 I am trying to use BulkOutputFormat to stream the data from map of Hadoop 
 job. I have set the cassandra related configuration using ConfigHelper ,Also 
 have looked into Cassandra code seems Cassandra has taken care that it should 
 not look for the cassandra.yaml file.
 But still when I run the job i get the following error:
 {
 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1
 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015
 12/01/13 11:30:05 INFO mapred.JobClient:  map 0% reduce 0%
 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : 
 attempt_201201130910_0015_m_00_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
 attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml
 attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to 
 start server.
 }
 Also let me know how can i make this cassandra.yaml file available to Hadoop 
 mapreduce job?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.

2012-02-01 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13198577#comment-13198577
 ] 

Samarth Gahire commented on CASSANDRA-3740:
---

Cool! Its Working Perfect with the updated patches.
Can you please explain 
1) what is the significance of INPUT_INITIAL_THRIFT_ADDRESS for 
BulkOutPutFormat.
2) What am I suppose to provide there?(If it is needed)
3) Is there any need to provide Listen address of the Hadoop Nodes for 
BulkOutputFormat if yes How to provide the same?

Actually we are experiencing the problem while loading the data where it fails 
to connect if the host the M/R job is running on is dualstack, i.e. has both 
IPv4 and IPv6. 
Also it works when cassandra.yaml is provided ,may be it is reading listen 
address or something from cassandra.yaml.

 While using BulkOutputFormat  unneccessarily look for the cassandra.yaml file.
 --

 Key: CASSANDRA-3740
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
  Labels: cassandra, hadoop, mapreduce
 Fix For: 1.1

 Attachments: 0001-Make-DD-the-canonical-partitioner-source.txt, 
 0002-Prevent-loading-from-yaml.txt, 0003-use-output-partitioner.txt, 
 0004-update-BOF-for-new-dir-layout.txt


 I am trying to use BulkOutputFormat to stream the data from map of Hadoop 
 job. I have set the cassandra related configuration using ConfigHelper ,Also 
 have looked into Cassandra code seems Cassandra has taken care that it should 
 not look for the cassandra.yaml file.
 But still when I run the job i get the following error:
 {
 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1
 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015
 12/01/13 11:30:05 INFO mapred.JobClient:  map 0% reduce 0%
 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : 
 attempt_201201130910_0015_m_00_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
 attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml
 attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to 
 start server.
 }
 Also let me know how can i make this cassandra.yaml file available to Hadoop 
 mapreduce job?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.

2012-01-17 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188320#comment-13188320
 ] 

Samarth Gahire commented on CASSANDRA-3740:
---

Apart From these issues I do not think that we are considering the TTL case in 
BulkOutputFormat.
Have looked into the code and can only see the addColumn() method and the 
addExpiringColumn() is not used anywhere.

 While using BulkOutputFormat  unneccessarily look for the cassandra.yaml file.
 --

 Key: CASSANDRA-3740
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
  Labels: cassandra, hadoop, mapreduce
 Fix For: 1.1


 I am trying to use BulkOutputFormat to stream the data from map of Hadoop 
 job. I have set the cassandra related configuration using ConfigHelper ,Also 
 have looked into Cassandra code seems Cassandra has taken care that it should 
 not look for the cassandra.yaml file.
 But still when I run the job i get the following error:
 {
 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1
 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015
 12/01/13 11:30:05 INFO mapred.JobClient:  map 0% reduce 0%
 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : 
 attempt_201201130910_0015_m_00_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
 attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml
 attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to 
 start server.
 }
 Also let me know how can i make this cassandra.yaml file available to Hadoop 
 mapreduce job?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.

2012-01-14 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186140#comment-13186140
 ] 

Samarth Gahire commented on CASSANDRA-3740:
---

I have checked into the code and I can see following issues:
 1) org.apache.cassandra.config.Config do not initialize the all the 
properties and which results into the null pointer exception in a static block 
of class DatabaseDescriptor for example conf.commitlog_sync

 2) I cant see any method in ConfigHelper to specify that I am using 
Supercolumn and clustername(dont know if cluster name is really needed)

 3) Also there is no method to specify comparator and subcomparator in 
ConfigHelper and it seems like comparators have been hard coded to BytesType


 While using BulkOutputFormat  unneccessarily look for the cassandra.yaml file.
 --

 Key: CASSANDRA-3740
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
  Labels: cassandra, hadoop, mapreduce
 Fix For: 1.1


 I am trying to use BulkOutputFormat to stream the data from map of Hadoop 
 job. I have set the cassandra related configuration using ConfigHelper ,Also 
 have looked into Cassandra code seems Cassandra has taken care that it should 
 not look for the cassandra.yaml file.
 But still when I run the job i get the following error:
 {
 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1
 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015
 12/01/13 11:30:05 INFO mapred.JobClient:  map 0% reduce 0%
 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : 
 attempt_201201130910_0015_m_00_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
 attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml
 attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to 
 start server.
 }
 Also let me know how can i make this cassandra.yaml file available to Hadoop 
 mapreduce job?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.

2012-01-14 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186185#comment-13186185
 ] 

Samarth Gahire commented on CASSANDRA-3740:
---

One More issue:

I can set the Partitioner into jobconf using ConfigHelper ,but it is no 
where used to set into the DatabaseDescriptor and not even initialized into 
the DatabaseDescriptor.initDefaultsOnly().
But AbstractSSTableSimpleWriter uses partioner (at line number 94) which will 
result into the null pointer.

 While using BulkOutputFormat  unneccessarily look for the cassandra.yaml file.
 --

 Key: CASSANDRA-3740
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
  Labels: cassandra, hadoop, mapreduce
 Fix For: 1.1


 I am trying to use BulkOutputFormat to stream the data from map of Hadoop 
 job. I have set the cassandra related configuration using ConfigHelper ,Also 
 have looked into Cassandra code seems Cassandra has taken care that it should 
 not look for the cassandra.yaml file.
 But still when I run the job i get the following error:
 {
 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1
 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015
 12/01/13 11:30:05 INFO mapred.JobClient:  map 0% reduce 0%
 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : 
 attempt_201201130910_0015_m_00_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
 attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml
 attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to 
 start server.
 }
 Also let me know how can i make this cassandra.yaml file available to Hadoop 
 mapreduce job?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.

2012-01-13 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186086#comment-13186086
 ] 

Samarth Gahire commented on CASSANDRA-3740:
---

I am using apache-cassandra-2011-12-27_22-01-50-bin.tar.gz build from jenkins 
. Let me know if I am using the correct binaries . 

 While using BulkOutputFormat  unneccessarily look for the cassandra.yaml file.
 --

 Key: CASSANDRA-3740
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740
 Project: Cassandra
  Issue Type: Bug
  Components: Hadoop
Affects Versions: 1.1
Reporter: Samarth Gahire
Assignee: Brandon Williams
  Labels: cassandra, hadoop, mapreduce
 Fix For: 1.1


 I am trying to use BulkOutputFormat to stream the data from map of Hadoop 
 job. I have set the cassandra related configuration using ConfigHelper ,Also 
 have looked into Cassandra code seems Cassandra has taken care that it should 
 not look for the cassandra.yaml file.
 But still when I run the job i get the following error:
 {
 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing 
 the arguments. Applications should implement Tool for the same.
 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1
 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015
 12/01/13 11:30:05 INFO mapred.JobClient:  map 0% reduce 0%
 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : 
 attempt_201201130910_0015_m_00_0, Status : FAILED
 java.lang.Throwable: Child Error
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
 Caused by: java.io.IOException: Task process exit with nonzero status of 1.
 at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)
 attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml
 attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to 
 start server.
 }
 Also let me know how can i make this cassandra.yaml file available to Hadoop 
 mapreduce job?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x

2011-12-14 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169343#comment-13169343
 ] 

Samarth Gahire commented on CASSANDRA-3589:
---

I have used apache-cassandra-2011-12-14_03-12-43-bin.tar.gz binary downloaded 
from https://builds.apache.org/job/Cassandra/1255/artifact/cassandra/build/; 
for testing sstable generation.
I hope ,I have used a proper binary for testing .Please correct me if I am 
wrong.

There is a massive improvement.Thank you so much for the fix.
Eagerly waiting for sstable-loader fix.


 Degraded performance of sstable-generator api and sstable-loader utility in 
 cassandra 1.0.x
 ---

 Key: CASSANDRA-3589
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.0.0
Reporter: Samarth Gahire
Assignee: Sylvain Lebresne
Priority: Minor

 we are using Sstable-Generation API and Sstable-Loader utility.As soon as 
 newer version of cassandra releases I test them for sstable generation and 
 loading for time taken by both the processes.Till cassandra 0.8.7 there is no 
 significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 
 times degraded performance in generation and 2 times degraded performance in 
 loading.Because of this we are not upgrading the cassandra to latest version 
 as we are processing some TeraBytes of data everyday time taken is very 
 important.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x

2011-12-13 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168321#comment-13168321
 ] 

Samarth Gahire commented on CASSANDRA-3589:
---

No, I do not have any secondary indexes on any of the column family and I have 
done the fair comparison and seen some performance hit in sstable-loader 
utility. 

 Degraded performance of sstable-generator api and sstable-loader utility in 
 cassandra 1.0.x
 ---

 Key: CASSANDRA-3589
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.0.0
Reporter: Samarth Gahire
Assignee: Sylvain Lebresne
Priority: Minor

 we are using Sstable-Generation API and Sstable-Loader utility.As soon as 
 newer version of cassandra releases I test them for sstable generation and 
 loading for time taken by both the processes.Till cassandra 0.8.7 there is no 
 significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 
 times degraded performance in generation and 2 times degraded performance in 
 loading.Because of this we are not upgrading the cassandra to latest version 
 as we are processing some TeraBytes of data everyday time taken is very 
 important.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x

2011-12-09 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13166011#comment-13166011
 ] 

Samarth Gahire commented on CASSANDRA-3589:
---

I have checked the CPU and IO utilization

1) While generating sstables CPU utilization with cassandra-0.8.7 is around 80% 
while it is around 90-95% in cassandra-1.0.5 
2) While generating sstables We can see I/O that is disk write after each 20 
-25 seconds and cassandra0.8.7 write to disk with around 35mbps while 
cassandra-1.0.5 write to disk with 19mbps 

Apart from this i cant see any deference let me know for additional information.

 Degraded performance of sstable-generator api and sstable-loader utility in 
 cassandra 1.0.x
 ---

 Key: CASSANDRA-3589
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.0.0
Reporter: Samarth Gahire
Assignee: Sylvain Lebresne
Priority: Minor

 we are using Sstable-Generation API and Sstable-Loader utility.As soon as 
 newer version of cassandra releases I test them for sstable generation and 
 loading for time taken by both the processes.Till cassandra 0.8.7 there is no 
 significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 
 times degraded performance in generation and 2 times degraded performance in 
 loading.Because of this we are not upgrading the cassandra to latest version 
 as we are processing some TeraBytes of data everyday time taken is very 
 important.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x

2011-12-08 Thread Samarth Gahire (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165139#comment-13165139
 ] 

Samarth Gahire commented on CASSANDRA-3589:
---

Actually i am just calculating the total time taken by the our program to 
generate the sstables.I can see this deference when i change the cassandra jar 
included in classpath of program.
About CPU and IO utilisation i will check and let you know as soon as possible.

 Degraded performance of sstable-generator api and sstable-loader utility in 
 cassandra 1.0.x
 ---

 Key: CASSANDRA-3589
 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589
 Project: Cassandra
  Issue Type: Bug
  Components: Tools
Affects Versions: 1.0.0
Reporter: Samarth Gahire
Priority: Minor

 we are using Sstable-Generation API and Sstable-Loader utility.As soon as 
 newer version of cassandra releases I test them for sstable generation and 
 loading for time taken by both the processes.Till cassandra 0.8.7 there is no 
 significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 
 times degraded performance in generation and 2 times degraded performance in 
 loading.Because of this we are not upgrading the cassandra to latest version 
 as we are processing some TeraBytes of data everyday time taken is very 
 important.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira