[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats
[ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13218192#comment-13218192 ] Samarth Gahire commented on CASSANDRA-3859: --- Ok ,So I tested patch for BulkOutputFormat with time out of 30 seconds. 1) First of all I tried it without applying a patch and it was throwing a timeout exception. 2) Then I applied a patch and it worked properly. It means with patch progress reporting is working but it is not reporting progress after every second while loading(can you explain this?). Because same patch throws timeout exception for 10 seconds of time out. Add Progress Reporting to Cassandra OutputFormats - Key: CASSANDRA-3859 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859 Project: Cassandra Issue Type: Improvement Components: Hadoop, Tools Affects Versions: 1.1.0 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, mapreduce, sstableloader Fix For: 1.1.0 Attachments: 0001-add-progress-reporting-to-BOF.txt, 0002-Add-progress-to-CFOF.txt Original Estimate: 48h Remaining Estimate: 48h When we are using the BulkOutputFormat to load the data to cassandra. We should use the progress reporting to Hadoop Job within Sstable loader because while loading the data for particular task if streaming is taking more time and progress is not reported to Job it may kill the task with timeout exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats
[ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13217154#comment-13217154 ] Samarth Gahire commented on CASSANDRA-3859: --- I have tested patch on CDH2 and latest CDH3 and also on hadoop-0.20.203 . I had set the time out to 10 seconds.As you confirmed that progress is reported after every second this job should not timed out for time out of 10 seconds.But it is still getting timed out for all the versions mentioned above. Please test and let me know if I am missing something. Add Progress Reporting to Cassandra OutputFormats - Key: CASSANDRA-3859 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859 Project: Cassandra Issue Type: Improvement Components: Hadoop, Tools Affects Versions: 1.1.0 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, mapreduce, sstableloader Fix For: 1.1.0 Attachments: 0001-add-progress-reporting-to-BOF.txt, 0002-Add-progress-to-CFOF.txt Original Estimate: 48h Remaining Estimate: 48h When we are using the BulkOutputFormat to load the data to cassandra. We should use the progress reporting to Hadoop Job within Sstable loader because while loading the data for particular task if streaming is taking more time and progress is not reported to Job it may kill the task with timeout exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats
[ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13215544#comment-13215544 ] Samarth Gahire commented on CASSANDRA-3859: --- @Brandon Williams : I just want to check my understanding that in a provided patch you are reporting a progress while streaming sstables after every 1000 ms, right? Add Progress Reporting to Cassandra OutputFormats - Key: CASSANDRA-3859 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859 Project: Cassandra Issue Type: Improvement Components: Hadoop, Tools Affects Versions: 1.1.0 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, mapreduce, sstableloader Fix For: 1.1.0 Attachments: 0001-add-progress-reporting-to-BOF.txt, 0002-Add-progress-to-CFOF.txt Original Estimate: 48h Remaining Estimate: 48h When we are using the BulkOutputFormat to load the data to cassandra. We should use the progress reporting to Hadoop Job within Sstable loader because while loading the data for particular task if streaming is taking more time and progress is not reported to Job it may kill the task with timeout exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats
[ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13207598#comment-13207598 ] Samarth Gahire commented on CASSANDRA-3859: --- Have not got the time to test it.Will let you know soon. Add Progress Reporting to Cassandra OutputFormats - Key: CASSANDRA-3859 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859 Project: Cassandra Issue Type: Improvement Components: Hadoop, Tools Affects Versions: 1.1.0 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, mapreduce, sstableloader Fix For: 1.1.0 Attachments: 0001-add-progress-reporting-to-BOF.txt, 0002-Add-progress-to-CFOF.txt Original Estimate: 48h Remaining Estimate: 48h When we are using the BulkOutputFormat to load the data to cassandra. We should use the progress reporting to Hadoop Job within Sstable loader because while loading the data for particular task if streaming is taking more time and progress is not reported to Job it may kill the task with timeout exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.
[ https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13206986#comment-13206986 ] Samarth Gahire commented on CASSANDRA-3740: --- First 4 patches working fine. About the patch related to CASSANDRA-3839 Erik can explain properly. While using BulkOutputFormat unneccessarily look for the cassandra.yaml file. -- Key: CASSANDRA-3740 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740 Project: Cassandra Issue Type: Bug Components: Hadoop Affects Versions: 1.1.0 Reporter: Samarth Gahire Assignee: Brandon Williams Labels: cassandra, hadoop, mapreduce Fix For: 1.1.0 Attachments: 0001-Make-DD-the-canonical-partitioner-source.txt, 0002-Prevent-loading-from-yaml.txt, 0003-use-output-partitioner.txt, 0004-update-BOF-for-new-dir-layout.txt, 0005-BWR-uses-any-if.txt I am trying to use BulkOutputFormat to stream the data from map of Hadoop job. I have set the cassandra related configuration using ConfigHelper ,Also have looked into Cassandra code seems Cassandra has taken care that it should not look for the cassandra.yaml file. But still when I run the job i get the following error: { 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015 12/01/13 11:30:05 INFO mapred.JobClient: map 0% reduce 0% 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : attempt_201201130910_0015_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to start server. } Also let me know how can i make this cassandra.yaml file available to Hadoop mapreduce job? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra OutputFormats
[ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13203433#comment-13203433 ] Samarth Gahire commented on CASSANDRA-3859: --- I have just checked the patches and it seems that you have added the progress reporting while the generation of sstables (When the write method of the BulkRecorWriter is executed). But in our case the timeout issue is because of the time taken for the streaming the sstables to the Cassandra (When the close() method of the BulkRecorWriter is executed). When the SSTableLoader comes into the picture and start loading the sstables if the size of the sstables generated is big and it is taking more than 10 minutes to load(stream),I dont see any progress reporting there, and the task will fail of timed out. Add Progress Reporting to Cassandra OutputFormats - Key: CASSANDRA-3859 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859 Project: Cassandra Issue Type: Improvement Components: Hadoop, Tools Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, mapreduce, sstableloader Fix For: 1.1 Attachments: 0001-add-progress-reporting-to-BOF.txt, 0002-Add-progress-to-CFOF.txt Original Estimate: 48h Remaining Estimate: 48h When we are using the BulkOutputFormat to load the data to cassandra. We should use the progress reporting to Hadoop Job within Sstable loader because while loading the data for particular task if streaming is taking more time and progress is not reported to Job it may kill the task with timeout exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat
[ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202090#comment-13202090 ] Samarth Gahire commented on CASSANDRA-3859: --- Yes I have been experiencing this issue and my job fails with following exception: {code} 12/02/06 11:40:25 INFO mapred.JobClient: Task Id : attempt_201202061119_0001_m_01_1, Status : FAILED Task attempt_201202061119_0001_m_01_1 failed to report status for 601 seconds. Killing! {code} Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat -- Key: CASSANDRA-3859 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859 Project: Cassandra Issue Type: Improvement Components: Hadoop, Tools Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, mapreduce, sstableloader Fix For: 1.1 Original Estimate: 48h Remaining Estimate: 48h When we are using the BulkOutputFormat to load the data to cassandra. We should use the progress reporting to Hadoop Job within Sstable loader because while loading the data for particular task if streaming is taking more time and progress is not reported to Job it may kill the task with timeout exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3859) Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat
[ https://issues.apache.org/jira/browse/CASSANDRA-3859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13202107#comment-13202107 ] Samarth Gahire commented on CASSANDRA-3859: --- I think the getProgress() method you are talking about is not related to Hadoop progress reporting. We should implement the hadoop progress reporting to resolve this problem. I guess {code}org.apache.hadoop.mapreduce.TaskAttemptContext.progress(){code} is the method provided by Hadoop to report the progress. I was trying to figure out the part of the code where it streams the sstables to the cassandra and taking time.I will try to report progress from that part of the code.Need your help to figure out the same. Add Progress Reporting to Cassandra SstableLoader for BulkOutputFormat -- Key: CASSANDRA-3859 URL: https://issues.apache.org/jira/browse/CASSANDRA-3859 Project: Cassandra Issue Type: Improvement Components: Hadoop, Tools Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, mapreduce, sstableloader Fix For: 1.1 Original Estimate: 48h Remaining Estimate: 48h When we are using the BulkOutputFormat to load the data to cassandra. We should use the progress reporting to Hadoop Job within Sstable loader because while loading the data for particular task if streaming is taking more time and progress is not reported to Job it may kill the task with timeout exception. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3851) Wrong Keyspace name is generated while streaming the sstables using BulkOutputFormat.
[ https://issues.apache.org/jira/browse/CASSANDRA-3851?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13200442#comment-13200442 ] Samarth Gahire commented on CASSANDRA-3851: --- Got It Actually in cassandra-trunk we are handling it as {code} File outputdir = new File(getOutputLocation() + File.separator + keyspace + File.separator + ConfigHelper.getOutputColumnFamily(conf)); //dir must be named by ks/cf for the loader {code} That is the reason it is creating the keyspace name properly. So Its a bug in cassandra-1.1. Wrong Keyspace name is generated while streaming the sstables using BulkOutputFormat. - Key: CASSANDRA-3851 URL: https://issues.apache.org/jira/browse/CASSANDRA-3851 Project: Cassandra Issue Type: Bug Components: Hadoop, Tools Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Priority: Minor Labels: bulkloader, hadoop, sstableloader Fix For: 1.1 Original Estimate: 48h Remaining Estimate: 48h I have merge the committed changes of [CASSANDRA-3828|https://issues.apache.org/jira/browse/CASSANDRA-3828] into my cassadra-trunk. Also the changes for the OutputLocation. But when I tried to load the sstables with hadoop job it results into the following exception: {code} 12/02/04 11:19:12 INFO mapred.JobClient: map 6% reduce 0% 12/02/04 11:19:14 INFO mapred.JobClient: Task Id : attempt_201202041114_0001_m_01_1, Status : FAILED java.lang.RuntimeException: Could not retrieve endpoint ranges: at org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:252) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:117) at org.apache.cassandra.io.sstable.SSTableLoader.stream(SSTableLoader.java:112) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:182) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:167) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:650) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:765) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:369) at org.apache.hadoop.mapred.Child$4.run(Child.java:259) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059) at org.apache.hadoop.mapred.Child.main(Child.java:253) Caused by: InvalidRequestException (*why:There is no ring for the keyspace: tmp*) at org.apache.cassandra.thrift.Cassandra$describe_ring_result.read(Cassandra.java:24053) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_describe_ring(Cassandra.java:1065) at org.apache.cassandra.thrift.Cassandra$Client.describe_ring(Cassandra.java:1052) at org.apache.cassandra.hadoop.BulkRecordWriter$ExternalClient.init(BulkRecordWriter.java:225) ... 12 more {code} After looking into the code I figured out that as we are setting the OUTPUTLOCATION with system property java.io.tmpdir the output directory is getting created as: /tmp/Keyspace_Name So in SSTableLoader while generating the kespace name like {code} this.keyspace = directory.getParentFile().getName(); {code} It is setting the keyspace name as tmp and results into the above exception. I have changed the code as: {code}this.keyspace = directory.getName();{code} and it works perfect. But I am wondering how it was working fine previously? Am I doing anything wrong ? or is it a bug? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.
[ https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13197745#comment-13197745 ] Samarth Gahire commented on CASSANDRA-3740: --- I Tested the patches all the patches except the yaml one works fine. I have checked the Yaml patch and my job still look for the cassandra.yaml file and fails. While using BulkOutputFormat unneccessarily look for the cassandra.yaml file. -- Key: CASSANDRA-3740 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740 Project: Cassandra Issue Type: Bug Components: Hadoop Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Labels: cassandra, hadoop, mapreduce Fix For: 1.1 Attachments: 0001-Make-DD-the-canonical-partitioner-source.txt, 0002-Prevent-loading-from-yaml.txt, 0003-use-output-partitioner.txt, 0004-update-BOF-for-new-dir-layout.txt I am trying to use BulkOutputFormat to stream the data from map of Hadoop job. I have set the cassandra related configuration using ConfigHelper ,Also have looked into Cassandra code seems Cassandra has taken care that it should not look for the cassandra.yaml file. But still when I run the job i get the following error: { 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015 12/01/13 11:30:05 INFO mapred.JobClient: map 0% reduce 0% 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : attempt_201201130910_0015_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to start server. } Also let me know how can i make this cassandra.yaml file available to Hadoop mapreduce job? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.
[ https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13198577#comment-13198577 ] Samarth Gahire commented on CASSANDRA-3740: --- Cool! Its Working Perfect with the updated patches. Can you please explain 1) what is the significance of INPUT_INITIAL_THRIFT_ADDRESS for BulkOutPutFormat. 2) What am I suppose to provide there?(If it is needed) 3) Is there any need to provide Listen address of the Hadoop Nodes for BulkOutputFormat if yes How to provide the same? Actually we are experiencing the problem while loading the data where it fails to connect if the host the M/R job is running on is dualstack, i.e. has both IPv4 and IPv6. Also it works when cassandra.yaml is provided ,may be it is reading listen address or something from cassandra.yaml. While using BulkOutputFormat unneccessarily look for the cassandra.yaml file. -- Key: CASSANDRA-3740 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740 Project: Cassandra Issue Type: Bug Components: Hadoop Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Labels: cassandra, hadoop, mapreduce Fix For: 1.1 Attachments: 0001-Make-DD-the-canonical-partitioner-source.txt, 0002-Prevent-loading-from-yaml.txt, 0003-use-output-partitioner.txt, 0004-update-BOF-for-new-dir-layout.txt I am trying to use BulkOutputFormat to stream the data from map of Hadoop job. I have set the cassandra related configuration using ConfigHelper ,Also have looked into Cassandra code seems Cassandra has taken care that it should not look for the cassandra.yaml file. But still when I run the job i get the following error: { 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015 12/01/13 11:30:05 INFO mapred.JobClient: map 0% reduce 0% 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : attempt_201201130910_0015_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to start server. } Also let me know how can i make this cassandra.yaml file available to Hadoop mapreduce job? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.
[ https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13188320#comment-13188320 ] Samarth Gahire commented on CASSANDRA-3740: --- Apart From these issues I do not think that we are considering the TTL case in BulkOutputFormat. Have looked into the code and can only see the addColumn() method and the addExpiringColumn() is not used anywhere. While using BulkOutputFormat unneccessarily look for the cassandra.yaml file. -- Key: CASSANDRA-3740 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740 Project: Cassandra Issue Type: Bug Components: Hadoop Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Labels: cassandra, hadoop, mapreduce Fix For: 1.1 I am trying to use BulkOutputFormat to stream the data from map of Hadoop job. I have set the cassandra related configuration using ConfigHelper ,Also have looked into Cassandra code seems Cassandra has taken care that it should not look for the cassandra.yaml file. But still when I run the job i get the following error: { 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015 12/01/13 11:30:05 INFO mapred.JobClient: map 0% reduce 0% 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : attempt_201201130910_0015_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to start server. } Also let me know how can i make this cassandra.yaml file available to Hadoop mapreduce job? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.
[ https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186140#comment-13186140 ] Samarth Gahire commented on CASSANDRA-3740: --- I have checked into the code and I can see following issues: 1) org.apache.cassandra.config.Config do not initialize the all the properties and which results into the null pointer exception in a static block of class DatabaseDescriptor for example conf.commitlog_sync 2) I cant see any method in ConfigHelper to specify that I am using Supercolumn and clustername(dont know if cluster name is really needed) 3) Also there is no method to specify comparator and subcomparator in ConfigHelper and it seems like comparators have been hard coded to BytesType While using BulkOutputFormat unneccessarily look for the cassandra.yaml file. -- Key: CASSANDRA-3740 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740 Project: Cassandra Issue Type: Bug Components: Hadoop Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Labels: cassandra, hadoop, mapreduce Fix For: 1.1 I am trying to use BulkOutputFormat to stream the data from map of Hadoop job. I have set the cassandra related configuration using ConfigHelper ,Also have looked into Cassandra code seems Cassandra has taken care that it should not look for the cassandra.yaml file. But still when I run the job i get the following error: { 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015 12/01/13 11:30:05 INFO mapred.JobClient: map 0% reduce 0% 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : attempt_201201130910_0015_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to start server. } Also let me know how can i make this cassandra.yaml file available to Hadoop mapreduce job? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.
[ https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186185#comment-13186185 ] Samarth Gahire commented on CASSANDRA-3740: --- One More issue: I can set the Partitioner into jobconf using ConfigHelper ,but it is no where used to set into the DatabaseDescriptor and not even initialized into the DatabaseDescriptor.initDefaultsOnly(). But AbstractSSTableSimpleWriter uses partioner (at line number 94) which will result into the null pointer. While using BulkOutputFormat unneccessarily look for the cassandra.yaml file. -- Key: CASSANDRA-3740 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740 Project: Cassandra Issue Type: Bug Components: Hadoop Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Labels: cassandra, hadoop, mapreduce Fix For: 1.1 I am trying to use BulkOutputFormat to stream the data from map of Hadoop job. I have set the cassandra related configuration using ConfigHelper ,Also have looked into Cassandra code seems Cassandra has taken care that it should not look for the cassandra.yaml file. But still when I run the job i get the following error: { 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015 12/01/13 11:30:05 INFO mapred.JobClient: map 0% reduce 0% 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : attempt_201201130910_0015_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to start server. } Also let me know how can i make this cassandra.yaml file available to Hadoop mapreduce job? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3740) While using BulkOutputFormat unneccessarily look for the cassandra.yaml file.
[ https://issues.apache.org/jira/browse/CASSANDRA-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13186086#comment-13186086 ] Samarth Gahire commented on CASSANDRA-3740: --- I am using apache-cassandra-2011-12-27_22-01-50-bin.tar.gz build from jenkins . Let me know if I am using the correct binaries . While using BulkOutputFormat unneccessarily look for the cassandra.yaml file. -- Key: CASSANDRA-3740 URL: https://issues.apache.org/jira/browse/CASSANDRA-3740 Project: Cassandra Issue Type: Bug Components: Hadoop Affects Versions: 1.1 Reporter: Samarth Gahire Assignee: Brandon Williams Labels: cassandra, hadoop, mapreduce Fix For: 1.1 I am trying to use BulkOutputFormat to stream the data from map of Hadoop job. I have set the cassandra related configuration using ConfigHelper ,Also have looked into Cassandra code seems Cassandra has taken care that it should not look for the cassandra.yaml file. But still when I run the job i get the following error: { 12/01/13 11:30:04 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 12/01/13 11:30:04 INFO input.FileInputFormat: Total input paths to process : 1 12/01/13 11:30:04 INFO mapred.JobClient: Running job: job_201201130910_0015 12/01/13 11:30:05 INFO mapred.JobClient: map 0% reduce 0% 12/01/13 11:30:23 INFO mapred.JobClient: Task Id : attempt_201201130910_0015_m_00_0, Status : FAILED java.lang.Throwable: Child Error at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271) Caused by: java.io.IOException: Task process exit with nonzero status of 1. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258) attempt_201201130910_0015_m_00_0: Cannot locate cassandra.yaml attempt_201201130910_0015_m_00_0: Fatal configuration error; unable to start server. } Also let me know how can i make this cassandra.yaml file available to Hadoop mapreduce job? -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x
[ https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13169343#comment-13169343 ] Samarth Gahire commented on CASSANDRA-3589: --- I have used apache-cassandra-2011-12-14_03-12-43-bin.tar.gz binary downloaded from https://builds.apache.org/job/Cassandra/1255/artifact/cassandra/build/; for testing sstable generation. I hope ,I have used a proper binary for testing .Please correct me if I am wrong. There is a massive improvement.Thank you so much for the fix. Eagerly waiting for sstable-loader fix. Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x --- Key: CASSANDRA-3589 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589 Project: Cassandra Issue Type: Bug Components: Tools Affects Versions: 1.0.0 Reporter: Samarth Gahire Assignee: Sylvain Lebresne Priority: Minor we are using Sstable-Generation API and Sstable-Loader utility.As soon as newer version of cassandra releases I test them for sstable generation and loading for time taken by both the processes.Till cassandra 0.8.7 there is no significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 times degraded performance in generation and 2 times degraded performance in loading.Because of this we are not upgrading the cassandra to latest version as we are processing some TeraBytes of data everyday time taken is very important. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x
[ https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13168321#comment-13168321 ] Samarth Gahire commented on CASSANDRA-3589: --- No, I do not have any secondary indexes on any of the column family and I have done the fair comparison and seen some performance hit in sstable-loader utility. Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x --- Key: CASSANDRA-3589 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589 Project: Cassandra Issue Type: Bug Components: Tools Affects Versions: 1.0.0 Reporter: Samarth Gahire Assignee: Sylvain Lebresne Priority: Minor we are using Sstable-Generation API and Sstable-Loader utility.As soon as newer version of cassandra releases I test them for sstable generation and loading for time taken by both the processes.Till cassandra 0.8.7 there is no significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 times degraded performance in generation and 2 times degraded performance in loading.Because of this we are not upgrading the cassandra to latest version as we are processing some TeraBytes of data everyday time taken is very important. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x
[ https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13166011#comment-13166011 ] Samarth Gahire commented on CASSANDRA-3589: --- I have checked the CPU and IO utilization 1) While generating sstables CPU utilization with cassandra-0.8.7 is around 80% while it is around 90-95% in cassandra-1.0.5 2) While generating sstables We can see I/O that is disk write after each 20 -25 seconds and cassandra0.8.7 write to disk with around 35mbps while cassandra-1.0.5 write to disk with 19mbps Apart from this i cant see any deference let me know for additional information. Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x --- Key: CASSANDRA-3589 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589 Project: Cassandra Issue Type: Bug Components: Tools Affects Versions: 1.0.0 Reporter: Samarth Gahire Assignee: Sylvain Lebresne Priority: Minor we are using Sstable-Generation API and Sstable-Loader utility.As soon as newer version of cassandra releases I test them for sstable generation and loading for time taken by both the processes.Till cassandra 0.8.7 there is no significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 times degraded performance in generation and 2 times degraded performance in loading.Because of this we are not upgrading the cassandra to latest version as we are processing some TeraBytes of data everyday time taken is very important. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (CASSANDRA-3589) Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x
[ https://issues.apache.org/jira/browse/CASSANDRA-3589?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13165139#comment-13165139 ] Samarth Gahire commented on CASSANDRA-3589: --- Actually i am just calculating the total time taken by the our program to generate the sstables.I can see this deference when i change the cassandra jar included in classpath of program. About CPU and IO utilisation i will check and let you know as soon as possible. Degraded performance of sstable-generator api and sstable-loader utility in cassandra 1.0.x --- Key: CASSANDRA-3589 URL: https://issues.apache.org/jira/browse/CASSANDRA-3589 Project: Cassandra Issue Type: Bug Components: Tools Affects Versions: 1.0.0 Reporter: Samarth Gahire Priority: Minor we are using Sstable-Generation API and Sstable-Loader utility.As soon as newer version of cassandra releases I test them for sstable generation and loading for time taken by both the processes.Till cassandra 0.8.7 there is no significant change in time taken.But in all cassandra-1.0.x i have seen 3-4 times degraded performance in generation and 2 times degraded performance in loading.Because of this we are not upgrading the cassandra to latest version as we are processing some TeraBytes of data everyday time taken is very important. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira