On Thu, May 6, 2010 at 5:38 PM, Vijay vijay2...@gmail.com wrote:
I would rather be interested in Tree type structure where supercolumns have
supercolumns in it. you dont need to compare all the columns to find a
set of columns and will also reduce the bytes transfered for separator, at
Hi everyone,
I am trying to develop a mapreduce job that does a simple
selection+filter on the rows in our store.
Of course it is mostly based on the WordCount example :)
Sadly, while it seems the app runs fine on a test keyspace with little
data, when run on a larger test index (but still on a
toronto :)
If not toronto, Virginia.
On Thu, May 6, 2010 at 5:28 PM, Jonathan Ellis jbel...@gmail.com wrote:
We're planning that now. Where would you like to see one?
On Thu, May 6, 2010 at 2:40 PM, S Ahmed sahmed1...@gmail.com wrote:
Do you have rough ideas when you would be doing the
On 2010-05-07 10:51, vineet daniel wrote:
what is the benefit of creating bloom filter when cassandra writes data,
how does it helps ?
http://wiki.apache.org/cassandra/ArchitectureOverview
--
David Strauss
| da...@fourkitchens.com
Four Kitchens
| http://fourkitchens.com
| +1 512 454
Reston, VA is a good spot in the DC metro area for tech events. The recent
Pragmatic Programmer Clojure class sold out and already has two more return
visits planned.
On May 7, 2010, at 6:42 AM, S Ahmed sahmed1...@gmail.com wrote:
toronto :)
If not toronto, Virginia.
On Thu, May 6,
what is the benefit of creating bloom filter when cassandra writes data, how
does it helps ?
It allows Cassandra to answer requests for non-existent keys without
going to disk, except in cases where the bloom filter gives a false
positive.
See:
Thanks David and Peter.
Is there any way to view the content of this file.
___
Vineet Daniel
___
Let your email find you
On Fri, May 7, 2010 at 4:24 PM, David Strauss da...@fourkitchens.comwrote:
On 2010-05-07 10:51,
On 2010-05-07 10:55, Peter Schüller wrote:
what is the benefit of creating bloom filter when cassandra writes data, how
does it helps ?
It allows Cassandra to answer requests for non-existent keys without
going to disk, except in cases where the bloom filter gives a false
positive.
See:
1. Peter said 'without going to disk' so that means bloom filters reside in
memory, always or just when request to that particular CF is made.
2. It is also important for identifying which SSTable files to look inside
even when a key is present. - David can you please throw some more light on
your
On 2010-05-07 11:03, vineet daniel wrote:
2. It is also important for identifying which SSTable files to look inside
even when a key is present. - David can you please throw some more
light on your point, like what are the implications, why do we need to
identify etc.
A bloom filter is almost
+1. There is some disagreement on whether or not the API should
return empty columns or skip rows when no data is found. In all of
our use cases, we would prefer skipped rows. And based on how
frequently new cassandra users appear to be confused about the current
behaviour, this might be a more
Huh? Isn't that the whole point of using Map/Reduce?
On Fri, May 7, 2010 at 8:44 AM, Jonathan Ellis jbel...@gmail.com wrote:
Sounds like you need to configure Hadoop to not create a whole bunch
of Map tasks at once
On Fri, May 7, 2010 at 3:47 AM, gabriele renzi rff@gmail.com wrote:
Hi
There's also the mapred.task.timeout property that can be tweaked. But
reporting is the correct way to fix timeouts during execution.
On May 7, 2010, at 8:49 AM, Joseph Stein wrote:
The problem could be that you are crunching more data than will be
completed within the interval expire
Joseph, the stacktrace suggests that it's Thrift that's timing out,
not the Task.
Gabriele, I believe that your problem is caused by too much load on
Cassandra. Get_range_slices is presently an expensive operation. I
had some success in reducing (although, it turns out, not eliminating)
this
The whole point is to parallelize to use the available capacity across
multiple machines. If you go past that point (fairly easy when you
have a single machine) then you're just contending for resources, not
making things faster.
On Fri, May 7, 2010 at 7:48 AM, Joost Ouwerkerk
On Fri, May 7, 2010 at 3:02 PM, Joost Ouwerkerk jo...@openplaces.org wrote:
Joseph, the stacktrace suggests that it's Thrift that's timing out,
not the Task.
Gabriele, I believe that your problem is caused by too much load on
Cassandra. Get_range_slices is presently an expensive operation. I
On Fri, May 7, 2010 at 2:53 PM, Matt Revelle mreve...@gmail.com wrote:
There's also the mapred.task.timeout property that can be tweaked. But
reporting is the correct way to fix timeouts during execution.
re: not reporting, I thought this was not needed with the new mapred
api (Mapper class
On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis jbel...@gmail.com wrote:
Sounds like you need to configure Hadoop to not create a whole bunch
of Map tasks at once
interesting, from a quick check it seems there are a dozen threads running.
Yet , setNumMapTasks seems to be deprecated (together
On 5/6/10 3:26 PM, Stu Hood wrote:
Ian: I think that as get_range_slice gets faster, the approach that Mark was
heading toward may be considerably more efficient than reading the old value in
the OutputFormat.
Interesting, I'm trying to understand the performance impact of the
different
you can manage the number of map tasks by node
mapred.tasktracker.map.tasks.maximum=1
On Fri, May 7, 2010 at 9:53 AM, gabriele renzi rff@gmail.com wrote:
On Fri, May 7, 2010 at 2:44 PM, Jonathan Ellis jbel...@gmail.com wrote:
Sounds like you need to configure Hadoop to not create a whole
It would be great if you could make a video of this event. Yes it won't
like being there 1-1, but it sure would help get up to speed.
On Fri, May 7, 2010 at 6:56 AM, Matt Revelle mreve...@gmail.com wrote:
Reston, VA is a good spot in the DC metro area for tech events. The recent
Pragmatic
On Wed, 2010-05-05 at 11:31 -0700, Ed Anuff wrote:
Follow-up from last weeks discussion, I've been playing around with a
simple
column comparator for composite column names that I put up on github.
I'd
be interested to hear what people think of this approach.
On Fri, May 7, 2010 at 5:29 AM, Joost Ouwerkerk jo...@openplaces.orgwrote:
+1. There is some disagreement on whether or not the API should
return empty columns or skip rows when no data is found. In all of
our use cases, we would prefer skipped rows. And based on how
frequently new
I've got two (out of five) nodes on my cassandra ring that somehow got
too full (e.g. over 60% disk space utilization). I've now gotten a few
new machines added to the ring, but evertime one of the overfull nodes
attempts to stream its data it runs out of diskspace... I've tried half
a dozen
Columns are sorted (see CompareWith/CompareSubcolumnsWith) keys are not.
On 7 maj 2010, at 22.10em, AJ Chen wrote:
I have a super column family for topic, key being the name of the topic.
ColumnFamily Name=Topic CompareWith=UTF8Type ColumnType=Super
CompareSubcolumnsWith=BytesType /
When I
If you're using RackUnawareStrategy (the default replication strategy)
then you can bootstrap manually fairly easily -- copy all the data
(not system) sstables from an overfull machine to a new machine,
assign the new one a token that gives it about half of the old node's
range, then start it with
thanks, that works. -aj
On Fri, May 7, 2010 at 1:17 PM, Stu Hood stu.h...@rackspace.com wrote:
Your IPartitioner implementation decides how the row keys are sorted: see
http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner . You
need to be using one of the
List,
I have a case where visitors to a site are tracked via a persistent cookie
containing a guid. This cookie is created and set when missing. Some of these
visitors are logged in, meaning a userId may also be available. What I’m
looking to do is have a way to associate each userId with all
i did a lot of comparisons between voldemort and cassandra and in the end i
decided to go with cassandra. the main reason was recovery and balancing
operations. on the surface voldemort is s*** hot fast, until you need to
restore a node or add nodes. BDB (the default persistence solution)
Yes. When you flush from BMT, its like any other SSTable. Cassandra will
merge them through compaction.
That's good news, thanks for clarifying!
A few more related questions:
Are there any problems with issuing the flush command directly from code at
the end up a bulk insert? The BMT example
Hi everyone,
Can anyone throw a light at the benefits of using framed transport over
non-framed transport?
We are trying to sum up some performance tuning approaches of cassandra in our
project.
Can framed transport be counted?
Thanks
2010-05-08
Any reason why you aren't using Lucandra directly?
On Fri, May 7, 2010 at 8:21 PM, Tobias Jungen tobias.jun...@gmail.comwrote:
Greetings,
Started getting my feet wet with Cassandra in earnest this week. I'm
building a custom inverted index of sorts on top of Cassandra, in part
inspired by
Without going into too much depth: Our retrieval model is a bit more
structured than standard lucene retrieval, and I'm trying to leverage that
structure. Some of the terms we're going to retrieve against have high
occurrence, and because of that I'm worried about getting killed by
processing
Got it. I'm working on making term vectors optional and just store
frequency in this case. Just FYI.
On Sat, May 8, 2010 at 1:17 AM, Tobias Jungen tobias.jun...@gmail.comwrote:
Without going into too much depth: Our retrieval model is a bit more
structured than standard lucene retrieval, and
34 matches
Mail list logo