Re: map/reduce on Cassandra
+1 Regards, On Mon, Jan 25, 2010 at 10:47 AM, Jeff Hodges wrote: > 1) Works with RandomPartitioner. This is huge and the only way almost > everyone would able to use it. > 2) Ability to divide up the keys of a single node to more than one > mapper. The prototype just slurped up everything on the node. This > would probably be easiest to not allow as a configurable thing and > just let it be part of the InputSplit calculation. > 3) Progress information should be calculated and displayed. > > -- > Jeff > > On Mon, Jan 25, 2010 at 5:43 AM, Phillip Michalak > wrote: > > Multiple people have expressed an interest in 'hadoop integration' and > > 'map/reduce functionality' within Cassandra. I'd like to get a feel for > what > > that means to different people. > > > > As a starting point for discussion, Jeff Hodges undertook a prototype > effort > > last summer which was the subject of this thread: > > > http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3cf5f3a6290907240123y22f065edp1649f7c5c1add...@mail.gmail.com%3e > . > > > > Jeff explicitly mentions data locality as one of the things that was out > of > > scope for the prototype. What other features or characteristics would you > > expect to see in an implementation? > > > > Thanks, > > Phil > > >
Re: map/reduce on Cassandra
sstablekeys is really the wrong place to support m/r anyway, it just shows that the index can handle what m/r will need On Mon, Jan 25, 2010 at 1:28 PM, Ryan Daum wrote: > On Mon, Jan 25, 2010 at 2:18 PM, Brandon Williams wrote: > >> bin/sstablekeys will dump just the keys from an sstable without row >> deserialization overhead, but it can't introspect a commitlog. >> -Brandon > > Yes, and will it not also return the keys that are replicas from > ranges 'belonging' to other nodes? I.e. running it on all boxes across > a cluster of with an RF > 1 would return duplicates where the data > was replicated. Needs a flag to indicate uniqueness. > > Ryan >
Re: map/reduce on Cassandra
On Mon, Jan 25, 2010 at 2:18 PM, Brandon Williams wrote: > bin/sstablekeys will dump just the keys from an sstable without row > deserialization overhead, but it can't introspect a commitlog. > -Brandon Yes, and will it not also return the keys that are replicas from ranges 'belonging' to other nodes? I.e. running it on all boxes across a cluster of with an RF > 1 would return duplicates where the data was replicated. Needs a flag to indicate uniqueness. Ryan
Re: map/reduce on Cassandra
On Mon, Jan 25, 2010 at 1:13 PM, Ryan Daum wrote: > I agree with what Jeff says here about RandomPartitioner support being key. > > +1 > For my purposes with map/reduce I'd personally be fine with some > general all-keys dump utility that wrote contents of one node to a > file, and then just write my own integration from that file into > Hadoop, etc.. > > I guess I'm thinking something similar to sstable2json except that > unfortunately sstable2json will dump replica data not just the local > node's data. Getting the contents of the commitlog into the file would > be nice, too. bin/sstablekeys will dump just the keys from an sstable without row deserialization overhead, but it can't introspect a commitlog. -Brandon
Re: map/reduce on Cassandra
I agree with what Jeff says here about RandomPartitioner support being key. For my purposes with map/reduce I'd personally be fine with some general all-keys dump utility that wrote contents of one node to a file, and then just write my own integration from that file into Hadoop, etc.. I guess I'm thinking something similar to sstable2json except that unfortunately sstable2json will dump replica data not just the local node's data. Getting the contents of the commitlog into the file would be nice, too. R On Mon, Jan 25, 2010 at 1:47 PM, Jeff Hodges wrote: > 1) Works with RandomPartitioner. This is huge and the only way almost > everyone would able to use it. > 2) Ability to divide up the keys of a single node to more than one > mapper. The prototype just slurped up everything on the node. This > would probably be easiest to not allow as a configurable thing and > just let it be part of the InputSplit calculation. > 3) Progress information should be calculated and displayed. > -- > Jeff > > On Mon, Jan 25, 2010 at 5:43 AM, Phillip Michalak > wrote: >> Multiple people have expressed an interest in 'hadoop integration' and >> 'map/reduce functionality' within Cassandra. I'd like to get a feel for what >> that means to different people. >> >> As a starting point for discussion, Jeff Hodges undertook a prototype effort >> last summer which was the subject of this thread: >> http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3cf5f3a6290907240123y22f065edp1649f7c5c1add...@mail.gmail.com%3e. >> >> Jeff explicitly mentions data locality as one of the things that was out of >> scope for the prototype. What other features or characteristics would you >> expect to see in an implementation? >> >> Thanks, >> Phil >> >
Re: map/reduce on Cassandra
1) Works with RandomPartitioner. This is huge and the only way almost everyone would able to use it. 2) Ability to divide up the keys of a single node to more than one mapper. The prototype just slurped up everything on the node. This would probably be easiest to not allow as a configurable thing and just let it be part of the InputSplit calculation. 3) Progress information should be calculated and displayed. -- Jeff On Mon, Jan 25, 2010 at 5:43 AM, Phillip Michalak wrote: > Multiple people have expressed an interest in 'hadoop integration' and > 'map/reduce functionality' within Cassandra. I'd like to get a feel for what > that means to different people. > > As a starting point for discussion, Jeff Hodges undertook a prototype effort > last summer which was the subject of this thread: > http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3cf5f3a6290907240123y22f065edp1649f7c5c1add...@mail.gmail.com%3e. > > Jeff explicitly mentions data locality as one of the things that was out of > scope for the prototype. What other features or characteristics would you > expect to see in an implementation? > > Thanks, > Phil >
map/reduce on Cassandra
Multiple people have expressed an interest in 'hadoop integration' and 'map/reduce functionality' within Cassandra. I'd like to get a feel for what that means to different people. As a starting point for discussion, Jeff Hodges undertook a prototype effort last summer which was the subject of this thread: http://mail-archives.apache.org/mod_mbox/incubator-cassandra-dev/200907.mbox/%3cf5f3a6290907240123y22f065edp1649f7c5c1add...@mail.gmail.com%3e . Jeff explicitly mentions data locality as one of the things that was out of scope for the prototype. What other features or characteristics would you expect to see in an implementation? Thanks, Phil
RE: Map Reduce on Cassandra Store
AH, there we go. Thanks a lot Ryan! -Mark -Original Message- From: Ryan King [mailto:r...@twitter.com] Sent: Friday, December 04, 2009 12:17 PM To: cassandra-user@incubator.apache.org Subject: Re: Map Reduce on Cassandra Store On Fri, Dec 4, 2009 at 8:44 AM, Mark Vigeant wrote: > Hello! > > > > Has anyone tried to run MapReduce analytics on data stored in Cassandra? I > feel like I saw a patch once to get hadoop working on top of Cassandra, but > I can't find it now. I know that Hadoop integration is big on people's > wishlists for future versions of Cassandra, but I'm just curious as to > what's available now. http://issues.apache.org/jira/browse/CASSANDRA-342 There's no easy way to do it now, but I know we will certainly need it at some point (as will others), so I'm sure it will eventually happen. -ryan > > > > Can anybody out there lend me a hand, or should I stick to HBase? Thanks a > lot! > > > > > > Mark Vigeant > > RiskMetrics Group, Inc. > > > > This email message and any attachments are for the sole use of the intended > recipients and may contain proprietary and/or confidential information which > may be privileged or otherwise protected from disclosure. Any unauthorized > review, use, disclosure or distribution is prohibited. If you are not an > intended recipient, please contact the sender by reply email and destroy the > original message and any copies of the message as well as any attachments to > the original message. > This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.
Re: Map Reduce on Cassandra Store
On Fri, Dec 4, 2009 at 8:44 AM, Mark Vigeant wrote: > Hello! > > > > Has anyone tried to run MapReduce analytics on data stored in Cassandra? I > feel like I saw a patch once to get hadoop working on top of Cassandra, but > I can’t find it now. I know that Hadoop integration is big on people’s > wishlists for future versions of Cassandra, but I’m just curious as to > what’s available now. http://issues.apache.org/jira/browse/CASSANDRA-342 There's no easy way to do it now, but I know we will certainly need it at some point (as will others), so I'm sure it will eventually happen. -ryan > > > > Can anybody out there lend me a hand, or should I stick to HBase? Thanks a > lot! > > > > > > Mark Vigeant > > RiskMetrics Group, Inc. > > > > This email message and any attachments are for the sole use of the intended > recipients and may contain proprietary and/or confidential information which > may be privileged or otherwise protected from disclosure. Any unauthorized > review, use, disclosure or distribution is prohibited. If you are not an > intended recipient, please contact the sender by reply email and destroy the > original message and any copies of the message as well as any attachments to > the original message. >
Map Reduce on Cassandra Store
Hello! Has anyone tried to run MapReduce analytics on data stored in Cassandra? I feel like I saw a patch once to get hadoop working on top of Cassandra, but I can't find it now. I know that Hadoop integration is big on people's wishlists for future versions of Cassandra, but I'm just curious as to what's available now. Can anybody out there lend me a hand, or should I stick to HBase? Thanks a lot! Mark Vigeant RiskMetrics Group, Inc. This email message and any attachments are for the sole use of the intended recipients and may contain proprietary and/or confidential information which may be privileged or otherwise protected from disclosure. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not an intended recipient, please contact the sender by reply email and destroy the original message and any copies of the message as well as any attachments to the original message.