Re: Is SuperColumn necessary?

2010-05-07 Thread Mike Malone
On Thu, May 6, 2010 at 5:38 PM, Vijay vijay2...@gmail.com wrote: I would rather be interested in Tree type structure where supercolumns have supercolumns in it. you dont need to compare all the columns to find a set of columns and will also reduce the bytes transfered for separator, at

timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
Hi everyone, I am trying to develop a mapreduce job that does a simple selection+filter on the rows in our store. Of course it is mostly based on the WordCount example :) Sadly, while it seems the app runs fine on a test keyspace with little data, when run on a larger test index (but still on a

Re: Cassandra training on May 21 in Palo Alto

2010-05-07 Thread S Ahmed
toronto :) If not toronto, Virginia. On Thu, May 6, 2010 at 5:28 PM, Jonathan Ellis jbel...@gmail.com wrote: We're planning that now. Where would you like to see one? On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.com wrote: Do you have rough ideas when you would be doing the

Re: bloom filter

2010-05-07 Thread David Strauss
On 2010-05-07 10:51, vineet daniel wrote: what is the benefit of creating bloom filter when cassandra writes data, how does it helps ? http://wiki.apache.org/cassandra/ArchitectureOverview -- David Strauss | da...@fourkitchens.com Four Kitchens | http://fourkitchens.com | +1 512 454

Re: Cassandra training on May 21 in Palo Alto

2010-05-07 Thread Matt Revelle
Reston, VA is a good spot in the DC metro area for tech events. The recent Pragmatic Programmer Clojure class sold out and already has two more return visits planned. On May 7, 2010, at 6:42 AM, S Ahmed sahmed1...@gmail.com wrote: toronto :) If not toronto, Virginia. On Thu, May 6,

Re: bloom filter

2010-05-07 Thread Peter Schüller
what is the benefit of creating bloom filter when cassandra writes data, how does it helps ? It allows Cassandra to answer requests for non-existent keys without going to disk, except in cases where the bloom filter gives a false positive. See:

Re: bloom filter

2010-05-07 Thread vineet daniel
Thanks David and Peter. Is there any way to view the content of this file. ___ Vineet Daniel ___ Let your email find you On Fri, May 7, 2010 at 4:24 PM, David Strauss da...@fourkitchens.comwrote: On 2010-05-07 10:51,

Re: bloom filter

2010-05-07 Thread David Strauss
On 2010-05-07 10:55, Peter Schüller wrote: what is the benefit of creating bloom filter when cassandra writes data, how does it helps ? It allows Cassandra to answer requests for non-existent keys without going to disk, except in cases where the bloom filter gives a false positive. See:

Re: bloom filter

2010-05-07 Thread vineet daniel
1. Peter said 'without going to disk' so that means bloom filters reside in memory, always or just when request to that particular CF is made. 2. It is also important for identifying which SSTable files to look inside even when a key is present. - David can you please throw some more light on your

Re: bloom filter

2010-05-07 Thread David Strauss
On 2010-05-07 11:03, vineet daniel wrote: 2. It is also important for identifying which SSTable files to look inside even when a key is present. - David can you please throw some more light on your point, like what are the implications, why do we need to identify etc. A bloom filter is almost

Re: pagination through slices with deleted keys

2010-05-07 Thread Joost Ouwerkerk
+1. There is some disagreement on whether or not the API should return empty columns or skip rows when no data is found. In all of our use cases, we would prefer skipped rows. And based on how frequently new cassandra users appear to be confused about the current behaviour, this might be a more

Re: timeout while running simple hadoop job

2010-05-07 Thread Joost Ouwerkerk
Huh? Isn't that the whole point of using Map/Reduce? On Fri, May 7, 2010 at 8:44 AM, Jonathan Ellis jbel...@gmail.com wrote: Sounds like you need to configure Hadoop to not create a whole bunch of Map tasks at once On Fri, May 7, 2010 at 3:47 AM, gabriele renzi rff@gmail.com wrote: Hi

Re: timeout while running simple hadoop job

2010-05-07 Thread Matt Revelle
There's also the mapred.task.timeout property that can be tweaked. But reporting is the correct way to fix timeouts during execution. On May 7, 2010, at 8:49 AM, Joseph Stein wrote: The problem could be that you are crunching more data than will be completed within the interval expire

Re: timeout while running simple hadoop job

2010-05-07 Thread Joost Ouwerkerk
Joseph, the stacktrace suggests that it's Thrift that's timing out, not the Task. Gabriele, I believe that your problem is caused by too much load on Cassandra. Get_range_slices is presently an expensive operation. I had some success in reducing (although, it turns out, not eliminating) this

Re: timeout while running simple hadoop job

2010-05-07 Thread Jonathan Ellis
The whole point is to parallelize to use the available capacity across multiple machines. If you go past that point (fairly easy when you have a single machine) then you're just contending for resources, not making things faster. On Fri, May 7, 2010 at 7:48 AM, Joost Ouwerkerk

Re: timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
On Fri, May 7, 2010 at 3:02 PM, Joost Ouwerkerk jo...@openplaces.org wrote: Joseph, the stacktrace suggests that it's Thrift that's timing out, not the Task. Gabriele, I believe that your problem is caused by too much load on Cassandra.  Get_range_slices is presently an expensive operation. I

Re: timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
On Fri, May 7, 2010 at 2:53 PM, Matt Revelle mreve...@gmail.com wrote: There's also the mapred.task.timeout property that can be tweaked.  But reporting is the correct way to fix timeouts during execution. re: not reporting, I thought this was not needed with the new mapred api (Mapper class

Re: timeout while running simple hadoop job

2010-05-07 Thread gabriele renzi
On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis jbel...@gmail.com wrote: Sounds like you need to configure Hadoop to not create a whole bunch of Map tasks at once interesting, from a quick check it seems there are a dozen threads running. Yet , setNumMapTasks seems to be deprecated (together

Re: Updating (as opposed to just setting) Cassandra data via Hadoop

2010-05-07 Thread Ian Kallen
On 5/6/10 3:26 PM, Stu Hood wrote: Ian: I think that as get_range_slice gets faster, the approach that Mark was heading toward may be considerably more efficient than reading the old value in the OutputFormat. Interesting, I'm trying to understand the performance impact of the different

Re: timeout while running simple hadoop job

2010-05-07 Thread Joseph Stein
you can manage the number of map tasks by node mapred.tasktracker.map.tasks.maximum=1 On Fri, May 7, 2010 at 9:53 AM, gabriele renzi rff@gmail.com wrote: On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis jbel...@gmail.com wrote: Sounds like you need to configure Hadoop to not create a whole

Re: Cassandra training on May 21 in Palo Alto

2010-05-07 Thread S Ahmed
It would be great if you could make a video of this event. Yes it won't like being there 1-1, but it sure would help get up to speed. On Fri, May 7, 2010 at 6:56 AM, Matt Revelle mreve...@gmail.com wrote: Reston, VA is a good spot in the DC metro area for tech events. The recent Pragmatic

Re: Is SuperColumn necessary?

2010-05-07 Thread Eric Evans
On Wed, 2010-05-05 at 11:31 -0700, Ed Anuff wrote: Follow-up from last weeks discussion, I've been playing around with a simple column comparator for composite column names that I put up on github. I'd be interested to hear what people think of this approach.

Re: pagination through slices with deleted keys

2010-05-07 Thread Mike Malone
On Fri, May 7, 2010 at 5:29 AM, Joost Ouwerkerk jo...@openplaces.orgwrote: +1. There is some disagreement on whether or not the API should return empty columns or skip rows when no data is found. In all of our use cases, we would prefer skipped rows. And based on how frequently new

Overfull node

2010-05-07 Thread David Koblas
I've got two (out of five) nodes on my cassandra ring that somehow got too full (e.g. over 60% disk space utilization). I've now gotten a few new machines added to the ring, but evertime one of the overfull nodes attempts to stream its data it runs out of diskspace... I've tried half a dozen

Re: key is sorted?

2010-05-07 Thread Roger Schildmeijer
Columns are sorted (see CompareWith/CompareSubcolumnsWith) keys are not. On 7 maj 2010, at 22.10em, AJ Chen wrote: I have a super column family for topic, key being the name of the topic. ColumnFamily Name=Topic CompareWith=UTF8Type ColumnType=Super CompareSubcolumnsWith=BytesType / When I

Re: Overfull node

2010-05-07 Thread Jonathan Ellis
If you're using RackUnawareStrategy (the default replication strategy) then you can bootstrap manually fairly easily -- copy all the data (not system) sstables from an overfull machine to a new machine, assign the new one a token that gives it about half of the old node's range, then start it with

Re: key is sorted?

2010-05-07 Thread AJ Chen
thanks, that works. -aj On Fri, May 7, 2010 at 1:17 PM, Stu Hood stu.h...@rackspace.com wrote: Your IPartitioner implementation decides how the row keys are sorted: see http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You need to be using one of the

Data Modeling Conundrum

2010-05-07 Thread William Ashley
List, I have a case where visitors to a site are tracked via a persistent cookie containing a guid. This cookie is created and set when missing. Some of these visitors are logged in, meaning a userId may also be available. What I’m looking to do is have a way to associate each userId with all

RE: Cassandra vs. Voldemort benchmark

2010-05-07 Thread Todd Burruss
i did a lot of comparisons between voldemort and cassandra and in the end i decided to go with cassandra. the main reason was recovery and balancing operations. on the surface voldemort is s*** hot fast, until you need to restore a node or add nodes. BDB (the default persistence solution)

Re: BinaryMemtable and collisions

2010-05-07 Thread Tobias Jungen
Yes. When you flush from BMT, its like any other SSTable. Cassandra will merge them through compaction. That's good news, thanks for clarifying! A few more related questions: Are there any problems with issuing the flush command directly from code at the end up a bulk insert? The BMT example

Benefits of using framed transport over non-framed transport?

2010-05-07 Thread 王一锋
Hi everyone, Can anyone throw a light at the benefits of using framed transport over non-framed transport? We are trying to sum up some performance tuning approaches of cassandra in our project. Can framed transport be counted? Thanks 2010-05-08

Re: BinaryMemtable and collisions

2010-05-07 Thread Jake Luciani
Any reason why you aren't using Lucandra directly? On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen tobias.jun...@gmail.comwrote: Greetings, Started getting my feet wet with Cassandra in earnest this week. I'm building a custom inverted index of sorts on top of Cassandra, in part inspired by

Re: BinaryMemtable and collisions

2010-05-07 Thread Tobias Jungen
Without going into too much depth: Our retrieval model is a bit more structured than standard lucene retrieval, and I'm trying to leverage that structure. Some of the terms we're going to retrieve against have high occurrence, and because of that I'm worried about getting killed by processing

Re: BinaryMemtable and collisions

2010-05-07 Thread Jake Luciani
Got it. I'm working on making term vectors optional and just store frequency in this case. Just FYI. On Sat, May 8, 2010 at 1:17 AM, Tobias Jungen tobias.jun...@gmail.comwrote: Without going into too much depth: Our retrieval model is a bit more structured than standard lucene retrieval, and