Re: Cassandra Demo/Tutorial Applications

2010-03-12 Thread Jonathan Ellis
Also http://aws.amazon.com/publicdatasets/.

On Fri, Mar 12, 2010 at 11:59 PM, Ian Holsman  wrote:
> There are several large data sets on the net you could use to build. Demo
> with.
> Search logs, wikipedia, uk govt stuff
> Dbpedia may be interesting as they have some of the stuff extracted out
>
>
> ---
> Sent from my phone
> Ian Holsman - 703 879-3128
>
> On 13/03/2010, at 4:46 PM, Jonathan Ellis  wrote:
>
>> On Fri, Mar 12, 2010 at 1:55 PM, Krishna Sankar 
>> wrote:
>>>
>>> I was looking at this from CASSANDRA-873 as well as hands-on homework (!)
>>> for my OSCON tutorial. Have couple of questions. Would appreciate
>>> insights:
>>>
>>> A)  Cassandra-873 suggests Luenandra as one demo application
>>> B)  Are there other ideas that will bring out the various aspects of
>>> Cassandra ?
>>
>> multi-user blog (single-user is too easy :)
>> - extra credit: with full-text search using lucandra
>>
>> discussion forum
>> - also w/ FTS
>>
>>> C)  What would be the goal of demo apps ? Tutorial to help folks learn
>>> the
>>> ins and outs of Cassandra ? Show case capabilities ? I think
>>> Cassandra-873
>>> belongs to the latter; Twissandra most probably belongs to the former.
>>
>> I think you nailed it.
>>
>>> D)  Hadoop on Cassandra might be a good demo/tutorial
>>
>> Sure, I'll buy that.
>>
>> I can't think of any standalone projects for that, but "compute a
>> twissandra tag cloud" would be pretty cool.  (Might need to write a
>> twissandra bot to load stuff in to make an interesting cloud. :)
>>
>>> E)  How would one structure the infrastructure for the demo/tutorials ?
>>> What
>>> assumptions can we make in creating them ? As AMIs to be run in EC2 ?
>>
>> I'd probably go with "virtualbox images" as being simpler for people
>> who don't have an AWS key already.  (VB can read vmware player images,
>> i think.  But there is no free vmware for OS X, so you'd want to check
>> that before going w/ vmware format.)
>>
>> Or just have people d/l cassandra and a configuration xml.  Probably
>> easier than teaching people to use virtualbox who haven't before.
>>
>>> Also
>>> to be run on 2-3 local machines for folks who can spare some ? Or as
>>> multiple processes - all in one machine ?
>>
>> You're not going to have time to teach cluster management.  Keep it to 1.
>


Re: Cassandra Demo/Tutorial Applications

2010-03-12 Thread Jonathan Ellis
On Fri, Mar 12, 2010 at 1:55 PM, Krishna Sankar  wrote:
> I was looking at this from CASSANDRA-873 as well as hands-on homework (!)
> for my OSCON tutorial. Have couple of questions. Would appreciate insights:
>
> A)  Cassandra-873 suggests Luenandra as one demo application
> B)  Are there other ideas that will bring out the various aspects of
> Cassandra ?

multi-user blog (single-user is too easy :)
 - extra credit: with full-text search using lucandra

discussion forum
 - also w/ FTS

> C)  What would be the goal of demo apps ? Tutorial to help folks learn the
> ins and outs of Cassandra ? Show case capabilities ? I think Cassandra-873
> belongs to the latter; Twissandra most probably belongs to the former.

I think you nailed it.

> D)  Hadoop on Cassandra might be a good demo/tutorial

Sure, I'll buy that.

I can't think of any standalone projects for that, but "compute a
twissandra tag cloud" would be pretty cool.  (Might need to write a
twissandra bot to load stuff in to make an interesting cloud. :)

> E)  How would one structure the infrastructure for the demo/tutorials ? What
> assumptions can we make in creating them ? As AMIs to be run in EC2 ?

I'd probably go with "virtualbox images" as being simpler for people
who don't have an AWS key already.  (VB can read vmware player images,
i think.  But there is no free vmware for OS X, so you'd want to check
that before going w/ vmware format.)

Or just have people d/l cassandra and a configuration xml.  Probably
easier than teaching people to use virtualbox who haven't before.

> Also
> to be run on 2-3 local machines for folks who can spare some ? Or as
> multiple processes - all in one machine ?

You're not going to have time to teach cluster management.  Keep it to 1.


Re: Re: wo did some test on cassandra ,but the result puzzled us

2010-03-12 Thread Jonathan Ellis
yes, this is a single-threaded benchmark so if getRandomRow is slow at
all, it is going to skew the hell out of your results :)

2010/3/12 Bingbing Liu :
> the difference between se and random the test code is just how the key of 
> each record is generated.
>
> the test code is :
>
>
> long totalSWriteTime = 0;
> for (int i = 0; i < totalRows; i++) {
> byte[] key = dg.getRandomRow();//when sequential write , we use i as the key
> byte[] data = dg.generateValue();
> long start = System.currentTimeMillis();
> client.insert("Keyspace1", new String(key), new ColumnPath(
> "Standard1", null, "data".getBytes("UTF-8")), data,timestamp, 
> ConsistencyLevel.ONE);
> totalSWriteTime += (System.currentTimeMillis() - start);
>if(i % 1 == 0){
> System.out.println("Has write " + i);
>}
> }
>
> is there something wrong?
> 2010-03-12
>
>
>
> Bingbing Liu
>
>
>
> 发件人: Jonathan Ellis
> 发送时间: 2010-03-12  13:40:40
> 收件人: cassandra-dev
> 抄送:
> 主题: Re: wo did some test on cassandra ,but the result puzzled us
>
> why reads are slower than writes:
> http://wiki.apache.org/cassandra/FAQ#reads_slower_writes
> no idea on seq vs random.  i would not be surprised if there is a bug
> in your test code.
> On Fri, Mar 12, 2010 at 12:36 AM, Bingbing Liu  wrote:
>> We did some test on on Cassandra, and the benchmark is from Section 7 of the 
>> BigTable paper "Bigtable: A Distributed Storage System for Structured Data", 
>> the benchmark task includes: random write, random read, sequential write, 
>> and sequential read. The test results made us puzzled. We use a cluster of 5 
>> nodes (each node has a 4 cores cpu , 4G memory).The data for test is a table 
>> with 4,000,000  records each of which is 1000 bytes. The test results are as 
>> follows:
>> Sequential write:  875124 ms
>> Sequential read:  1972588 ms
>> Random read:  43331738 ms
>> Random write:  20193484 ms
>> We wondered why the speed of sequential write are so faster than the speed 
>> of sequential read, and why the speed of sequential write are so faster than 
>> that of random write? We think that the speed of read should be faster than 
>> that of data write, but the results are just the opposite, would you please 
>> give us some explanations, thanks a lot!
>>
>> 2010-03-12
>>
>>
>>
>> Bingbing Liu
>>
>


Re: wo did some test on cassandra ,but the result puzzled us

2010-03-11 Thread Jonathan Ellis
why reads are slower than writes:
http://wiki.apache.org/cassandra/FAQ#reads_slower_writes

no idea on seq vs random.  i would not be surprised if there is a bug
in your test code.

On Fri, Mar 12, 2010 at 12:36 AM, Bingbing Liu  wrote:
> We did some test on on Cassandra, and the benchmark is from Section 7 of the 
> BigTable paper “Bigtable: A Distributed Storage System for Structured Data”, 
> the benchmark task includes: random write, random read, sequential write, and 
> sequential read. The test results made us puzzled. We use a cluster of 5 
> nodes (each node has a 4 cores cpu , 4G memory).The data for test is a table 
> with 4,000,000  records each of which is 1000 bytes. The test results are as 
> follows:
> Sequential write:  875124 ms
> Sequential read:  1972588 ms
> Random read:  43331738 ms
> Random write:  20193484 ms
> We wondered why the speed of sequential write are so faster than the speed of 
> sequential read, and why the speed of sequential write are so faster than 
> that of random write? We think that the speed of read should be faster than 
> that of data write, but the results are just the opposite, would you please 
> give us some explanations, thanks a lot!
>
> 2010-03-12
>
>
>
> Bingbing Liu
>


Re: thinking about dropping hinted handoff

2010-03-10 Thread Jonathan Ellis
Read-only for a specific client is completely different from trying to
read-only the entire node / cluster.  So no, nothing wrong with that.

2010/3/10 Ted Zlatanov :
> On Fri, 26 Feb 2010 08:18:49 -0600 Ted Zlatanov  wrote:
>
> TZ> On Tue, 23 Feb 2010 12:30:52 -0600 Ted Zlatanov  wrote:
>> Can a Cassandra node be made read-only (as far as clients know)?
>
> TZ> I realized I have another use case for read-only access besides backups:
>
> TZ> On our network we have Cassandra readers, writers, and analyzers
> TZ> (read+write).  The writers and analyzers can run anywhere.  The readers
> TZ> can run anywhere too.  I don't want the readers to have write access but
> TZ> they should be able to read all keyspaces.
>
> TZ> I think the best way to solve this is with an IAuthenticator change to
> TZ> distinguish between full permissions and read-only permissions.  Then
> TZ> the Thrift API has to be modified to check for write access in only some
> TZ> functions:
> ...
> TZ> Does this seem reasonable?
>
> Any comments, while we're discussing authentication?  I think read-only
> access makes a lot of sense in this context.
>
> Ted
>
>


Re: Re: how to do the filter in cassanda?

2010-03-10 Thread Jonathan Ellis
https://issues.apache.org/jira/browse/CASSANDRA-764 added about the
simplest possible predicate, so if you look at the diff for that you
should have a pretty good idea where the magic happens.

On Tue, Mar 9, 2010 at 11:58 PM, Bingbing Liu  wrote:
> if i want to improve cassandra by add filter , which part of the code should 
> be modified?
>
> (i have tried to read the code , but as the cassandra is so big a system , i 
> failed)
>
> maybe someone can guide me ?
>
>
> 2010-03-10
>
>
>
> Bingbing Liu
>
>
>
> 发件人: Jonathan Ellis
> 发送时间: 2010-03-09  21:24:46
> 收件人: cassandra-dev
> 抄送:
> 主题: Re: how to do the filter in cassanda?
>
> If you mean WHERE clause-like filtering, that's always done client
> side right now.
> On Tue, Mar 9, 2010 at 1:00 AM, Bingbing Liu  wrote:
>> hi,
>> i mean when cassandra get data from the file system of each node (in other 
>> words, read data from file)
>>
>> does the filtering condition also been transform to the node ?
>>
>> if any , which part of the code do this job?
>>
>> thanks
>> 2010-03-09
>>
>>
>>
>> Bingbing Liu
>>
>


Re: Latest svn code

2010-03-10 Thread Jonathan Ellis
Both.

The latest 0.6 code is in the 0.6 branch.

The latest trunk code (will become 0.7) is in trunk.

Trunk is in "breaking stuff" mode right now.

On Wed, Mar 10, 2010 at 9:50 AM, David Dabbs  wrote:
> Hi. Is the latest code in trunk or the 0.6 branch?
>
> Thanks,
>
> david
>
>
>
>


Re: Further enhancments in j.a.c.auth

2010-03-10 Thread Jonathan Ellis
The latter.

On Wed, Mar 10, 2010 at 2:33 AM, Morten Wegelbye Nissen  wrote:
> Jonathan Ellis wrote:
>>
>> We should probably use http://www.mindrot.org/projects/jBCrypt/.
>> (Lots of background:
>>
>> http://chargen.matasano.com/chargen/2007/9/7/enough-with-the-rainbow-tables-what-you-need-to-know-about-s.html)
>>
>> We kind of have a nagging feeling though that rolling our own auth
>> framework in 2010 is the wrong approach.
>>
>
> Makes me feel that there is actually a roadmap somewhere? Or is it just the
> mailing archive.
>
> ./Morten
>


Re: Further enhancments in j.a.c.auth

2010-03-09 Thread Jonathan Ellis
We should probably use http://www.mindrot.org/projects/jBCrypt/.
(Lots of background:
http://chargen.matasano.com/chargen/2007/9/7/enough-with-the-rainbow-tables-what-you-need-to-know-about-s.html)

We kind of have a nagging feeling though that rolling our own auth
framework in 2010 is the wrong approach.
http://en.wikipedia.org/wiki/Simple_Authentication_and_Security_Layer
has been mentioned as an alternative.

The ML is the appropriate place for this, yes. :)

On Tue, Mar 9, 2010 at 3:42 PM, Morten Wegelbye Nissen  wrote:
> Hi All,
>
> In simple authenticator its possible to configure passwords to be stored as
> MD5 sums - for a security sucker there is two problems here.
> MD5 is broken[1].
> There is no salt added to clear value, means if two users choose to have
> same password, the encoded values would be the same.
> I suggest that someone add support for a alternative hashing algorithm. And
> that the hash is calculated with some prefix. (username maybe)
>
> I know the present is better then having the passwords in cleartext. But,
> when a user choose to enable the password hashing, it's for a reason. And
> there is no reason to choose to jump into the common security pitfalls :)
>
> btw. is it against the protocol to raise this kind of questions to this
> mailing list? Or should it be somewhere else?
>
> ./Morten
>
> [1] http://en.wikipedia.org/wiki/MD5   (Back in 1995 it was recommended not
> to base further security on md5)
>


Re: how to do the filter in cassanda?

2010-03-09 Thread Jonathan Ellis
If you mean WHERE clause-like filtering, that's always done client
side right now.

On Tue, Mar 9, 2010 at 1:00 AM, Bingbing Liu  wrote:
> hi,
> i mean when cassandra get data from the file system of each node (in other 
> words, read data from file)
>
> does the filtering condition also been transform to the node ?
>
> if any , which part of the code do this job?
>
> thanks
> 2010-03-09
>
>
>
> Bingbing Liu
>


Re: seqid_ in Cassandra.Client

2010-03-02 Thread Jonathan Ellis
We are excited about Avro, but it will be couple more releases before
it's usable.

-Jonathan

On Tue, Mar 2, 2010 at 10:17 AM, David Dabbs  wrote:
> Thanks. I'll check in with the Thrift team. I see there's an avro client as
> well.
> Is Avro the direction in which Cassandra is headed?
>
> Thanks,
>
> david
>
>> -Original Message-
>> From: Jonathan Ellis [mailto:jbel...@gmail.com]
>> Sent: Tuesday, March 02, 2010 10:02 AM
>> To: cassandra-dev@incubator.apache.org
>> Subject: Re: seqid_ in Cassandra.Client
>>
>> org.apache.cassandra.thrift.* (in 0.6) or .service (in 0.5) is
>> autogenerated by Thrift.  We try not to mess with the Thrift compiler
>> except to fix bugs, but you're welcome to take a stab at it. :)
>>
>> https://issues.apache.org/jira/browse/THRIFT
>>
>> On Tue, Mar 2, 2010 at 9:58 AM, David Dabbs  wrote:
>> > Hello. The seqid_ in Cassandra.Client appears to be unused. Is this
>> > vestigial or part of some work-in-progress?
>> >
>> > Thanks,
>> >
>> > david
>> >
>> >
>
>


Re: seqid_ in Cassandra.Client

2010-03-02 Thread Jonathan Ellis
org.apache.cassandra.thrift.* (in 0.6) or .service (in 0.5) is
autogenerated by Thrift.  We try not to mess with the Thrift compiler
except to fix bugs, but you're welcome to take a stab at it. :)

https://issues.apache.org/jira/browse/THRIFT

On Tue, Mar 2, 2010 at 9:58 AM, David Dabbs  wrote:
> Hello. The seqid_ in Cassandra.Client appears to be unused. Is this
> vestigial or part of some work-in-progress?
>
> Thanks,
>
> david
>
>


Re: Commit log changes in 0.7

2010-02-26 Thread Jonathan Ellis
Remember that the header is per-segment.  So I would just say that CF
modification forces a new segment creation.

We can support additions and deletions just by doing that, so that
should probably be the first milestone.  Renames are harder because
currently we map id -> name -> files on disk.  We'd need to change
that to id -> files on disk, id -> sequence of names and when they
were in effect for HH.  (Maybe this is the straw that breaks the HH
camel's back?)

-Jonathan

On Fri, Feb 26, 2010 at 11:22 AM, Gary Dusbabek  wrote:
> The commit log currently has a header section whose size is constant,
> but is a function of the total number of defined column families.
> This isn't going to work with CASSANDRA-44 where CFs can be added or
> removed on the fly.  I see two options:
>
> 1.  hard code the size of the header to accommodate a maximum number
> of CFs.  I don't like this approach because it imposes a hard limit.
> 2.  move the header into its own file.  I don't like this because it
> adds complexity.
>
> Any better other ideas?
>
> Some background:  the data stored in the header indicates a) which CFs
> are 'dirty' and b) log replay offsets for each CF.
>


Re: consistent backups

2010-02-25 Thread Jonathan Ellis
Go ahead.

2010/2/25 Ted Zlatanov :
> On Thu, 25 Feb 2010 08:22:38 -0600 Jonathan Ellis  wrote:
>
> JE> 2010/2/25 Ted Zlatanov :
>>> I want a consistent backup.
>
> JE> You can get an "eventually consistent backup" by flushing all nodes
> JE> and snapshotting; no individual node's backup is guaranteed to be
> JE> consistent but if you restore from that snapshot then clients will get
> JE> eventually consistent behavior as usual.
>
> JE> Other than that there is no such thing as a "consistent view of the
> JE> data" in the strict sense, except in the trivial case of writes with
> JE> CL.ALL.
>
> That makes perfect sense, thanks for explaining.  Can the explanation be
> part of the http://wiki.apache.org/cassandra/Operations section on
> backups?  I'll submit the edit if you want.
>
> Thanks
> Ted
>
>


Re: consistent backups (was: thinking about dropping hinted handoff)

2010-02-25 Thread Jonathan Ellis
2010/2/25 Ted Zlatanov :
> I want a consistent backup.

You can get an "eventually consistent backup" by flushing all nodes
and snapshotting; no individual node's backup is guaranteed to be
consistent but if you restore from that snapshot then clients will get
eventually consistent behavior as usual.

Other than that there is no such thing as a "consistent view of the
data" in the strict sense, except in the trivial case of writes with
CL.ALL.

-Jonathan


Re: [VOTE] Release 0.5.1

2010-02-23 Thread Jonathan Ellis
+1

On Tue, Feb 23, 2010 at 5:27 PM, Eric Evans  wrote:
>
> There have been some important bug fixes[1] in the 0.5 branch since we
> released. Plus, I thought it would be cool to conduct two separate
> release votes at the time. :)
>
> I propose the following tag and artifacts for 0.5.1:
>
> SVN Tag:
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.1
> 0.5.1 artifacts: http://people.apache.org/~eevans
>
> +1 from me.
>
>
> [1]
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.1/CHANGES.txt
>
> --
> Eric Evans
> eev...@rackspace.com
>
>
>
>


Re: thinking about dropping hinted handoff

2010-02-23 Thread Jonathan Ellis
2010/2/23 Ted Zlatanov :
> JE> because in a masterless environment there is no way to tell "when it's 
> over"
>
> Would it work to use an external agent?  It can get the list of nodes,
> make them all read-only, then wait until every node reports no write
> activity through JMX.

At that point I'd say you're deeply into "cure worse than the disease"
territory :)


Re: thinking about dropping hinted handoff

2010-02-23 Thread Jonathan Ellis
2010/2/23 Ted Zlatanov :
> You're welcome.  I don't understand why it doesn't help reach
> consistency, though.  If you turn all the nodes in a cluster read-only
> at the API level, what can make them inconsistent besides inter-node
> traffic and scheduled writes?  I'd assume that activity will die down
> eventually; can Cassandra tell a monitoring agent through JMX when it is
> over?

because in a masterless environment there is no way to tell "when it's over"


Re: thinking about dropping hinted handoff

2010-02-23 Thread Jonathan Ellis
2010/2/23 Ted Zlatanov :
>>> Can a Cassandra node be made read-only (as far as clients know)?
>
> JE> no.
>
> Is there value (for reaching consistency) in adding that functionality?

No.

Thanks for the easy questions today. :)

-Jonathan


Re: Tests ... large-volume, long-running, failure-case

2010-02-23 Thread Jonathan Ellis
contrib/py_stress is our standard performance tool.

I think contrib/ is only in the source distro.

On Tue, Feb 23, 2010 at 2:47 PM, Masood Mortazavi
 wrote:
> Hi folks,
>
> Besides the regression tests described in "How to Contribute," are there
> performance, large-data-volume, long-running or failure-case
> (regression-type or otherwise) tests, either in the binary or source
> distribution?
>
> (For now, I'm not sure whether the regression tests described in 'How to
> Contribute" actually cover any of these other test areas.)
>
> Regards,
> - m.
>
> P.S.
> As a side question, now that Cassandra has been promoted to a full-fledged
> project, will this alias now move to cassandra-...@apache.org?
>


Re: [VOTE] Release 0.6.0-beta2

2010-02-23 Thread Jonathan Ellis
+1

On Tue, Feb 23, 2010 at 1:36 PM, Eric Evans  wrote:
>
> There has been a lot of cool work done since our last release[1] (0.5.0),
> and while there's still a bit more to be done[2], the dust is settling on
> all of the big stuff. Now seems like a good time for a beta release.
>
> I propose the following tag and artifacts for 0.6.0-beta2[3]:
>
> SVN Tag:
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.6.0-beta2
> 0.6.0-beta2 artifacts: http://people.apache.org/~eevans
>
> +1 from me.
>
>
> [1] 
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.6.0-beta2/CHANGES.txt
> [2] https://issues.apache.org/jira/browse/CASSANDRA/fixforversion/12314361
> [3] beta1 was an aborted attempt so we're jumping straight to beta2,
> sorry about that.
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: thinking about dropping hinted handoff

2010-02-23 Thread Jonathan Ellis
no.

2010/2/23 Ted Zlatanov :
> On Mon, 22 Feb 2010 21:12:58 +0100 Peter Schüller  wrote:
>
> PS> In general, what are people's thoughts on the appropriate mechanism to
> PS> gain confidence that the cluster as a whole is reasonably consistent?
> PS> In particular in relation to performing maintenance that may require
> PS> popping nodes in and out in some kind of rolling fashion.
>
> Can a Cassandra node be made read-only (as far as clients know)?
>
> Ted
>
>


Re: thinking about dropping hinted handoff

2010-02-22 Thread Jonathan Ellis
On Mon, Feb 22, 2010 at 6:57 PM, Ryan King  wrote:
> I think I find it more compelling because we're currently experiencing
> pain related to HH. I'd be ok with keeping it as long as we can make
> the effects of a node down be less drastic.

Can you open a ticket and tag it 0.6?  I think I can implement #1
easily.  If I am wrong I will push to 0.7.

-Jonathan


Re: thinking about dropping hinted handoff

2010-02-22 Thread Jonathan Ellis
On Mon, Feb 22, 2010 at 6:56 PM, Ryan King  wrote:
> Maybe I mis-read the code, but I thought it was trigger for every compaction.

Full compactions is correct, at least that is the correct design. :)

-Jonathan


Re: thinking about dropping hinted handoff

2010-02-22 Thread Jonathan Ellis
On Mon, Feb 22, 2010 at 1:53 PM, Ryan King  wrote:
> So, after having some more experience with HH, I've reformed my
> opinion. I think we have 3 options:
>
> 1. Make the natural endpoints responsible for the hints.
> 2. Make a random node responsible for hints.
> 3. Get rid of HH.
>
> #1 reduces the "surprising effects in a small cluster" problem by
> adding a marginal amount of resource demands to nodes that already
> have the data we need.
>
> #2 will spread the load out. We had a node die last week and decided
> to leave it down so that we could learn about the effects of this
> situation. We eventually ended up killing the next node on the ring
> with all the hints (I think there some improvements to this in 0.6,
> but I don't know if they'll be enough). So, even on a large cluster
> (ours is currently 45 nodes), HH can have surprising effects on nodes
> that neighbor a node that's down. Picking either a random node or
> using the coordinator node for the hint would spread the load out.
>
> #3 is, I think, the right answer. It make our system simpler and it
> makes the behavior in failure conditions more predictable and safe.

This is a good summary of the options.

Why do you find 3 more compelling than 1?  Yes, it's simpler, but 1
would not require a large change to the exiting code, so perhaps we
need a better case than that to justify removing a feature that
already (mostly) works.

-Jonathan


http://wiki.apache.org/cassandra/Improving%20Initial%20User%20Experiance

2010-02-21 Thread Jonathan Ellis
shouldn't a list of to-dos be in jira?


Re: Building Cassandra from Behind a Proxy

2010-02-19 Thread Jonathan Ellis
Can you add this to http://wiki.apache.org/cassandra/HowToContribute ?

(that page looks like it could use a little refactoring)

On Thu, Feb 18, 2010 at 5:21 PM, Gary Dusbabek  wrote:
> I found this:
>
> set ANT_OPTS=-Dhttp.proxyHost=myproxy -Dhttp.proxyPort=3128
>
> from here:  http://www.jaya.free.fr/ivy/faq.html
>
> I have no idea how reliable it is, but please give it a try.
>
> Gary.
>
>
> On Thu, Feb 18, 2010 at 15:50, Masood Mortazavi
>  wrote:
>> How does one build Cassandra from behind a proxy?
>>
>> I just downloaded Cassandra source code and tried to build it locally -- all
>> from behind a firewall.
>>
>> Download ("svn checkout") succeeded with a proper configuration of [global]
>> proxy settings in the "servers" file in .subversion directory.
>>
>> However, the build -- again from behind the proxy -- fails when it tries to
>> download ivy-2.1.0.jar.
>>
>> It is not clear from this fragment of build.xml:
>>
>>    
>>      Downloading Ivy...
>>      
>>      >           dest="${build.dir}/ivy-${ivy.version}.jar" usetimestamp="true" />
>>    
>>
>> whether a http proxy server / port number / username / password can be set
>> in the build script configuration, some place, for the purpose\ of usage
>> when attempting the download target. (I cannot imagine such a facility not
>> be available.)
>>
>> If there's no such configuration, I'm assuming I can download the
>> ivy-2.1.0jar "by hand" and put it in the build director, and set
>> "ivy.jar.exists" to true.
>>
>> Thanks,
>> - m.
>>
>


Re: 0.6, 0.7, and the future

2010-02-18 Thread Jonathan Ellis
On Thu, Feb 18, 2010 at 2:49 PM, Ryan King  wrote:
> On Thu, Feb 18, 2010 at 11:45 AM, Anthony Molinaro
>  wrote:
>> +1
>> (although I'm dreading the export from old sstables into new sstables,
>> any ideas on how fast that might be?, and I guess any idea if a tool
>> for doing this will be provided, or is it sstable2json and json2sstable?)
>
> I'm not entirely sure of the plans here, but I think the idea is that
> any release of cassandra going forward is going to have to be able to
> deal with multiple versions of SSTable formats.
>
> I propose that when we change the format in a release we don't
> proactively rewrite the tables– just keep the old tables as-is and
> only write with the new format when we're already going to be writing
> a table (flushing or compacting). That way you can have a smooth
> upgrade if you want, or if you want to force the upgrade, you just
> trigger a major compaction.

This is my preference as well.  We'll see if we can pull it off. :)

-Jonathan


Re: 0.6, 0.7, and the future

2010-02-17 Thread Jonathan Ellis
On Wed, Feb 17, 2010 at 5:26 PM, David Strauss  wrote:
> How dependent are the 0.8 high-level queries on the 0.7 internal changes?

It will definitely be affected by the String -> byte[] change with
everything else.  It also may benefit from being able to have more
levels of subcolumns than the 1 we offer now.  Hard to say since we
don't really know what it's going to look like yet.

-Jonathan


0.6, 0.7, and the future

2010-02-17 Thread Jonathan Ellis
We're looking at branching 0.6 today and starting 0.7 work.

0.6 shaped up to be a really nice follow-up to 0.5, where we improved
just about everything while keeping the upgrade path super easy.  (We
changed the network around again, but no disk changes, so it's just
going to be shutdown-and-restart.)  Client APIs are 100% compatible,
with the exception of get_key_range, which we had tagged as deprecated
in 0.5 for removal in the next release.

So to recap recent history, we went from 0.4 to 0.5 to 0.6 without
major client level API changes.  I think that is an excellent record
for where we were when we were releasing 0.4.

0.7 is probably going to be a little more painful, after which we hope
to have things stable for another few releases, at the least.

Tickets currently tagged 0.7 are here:

https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&resolution=-1&pid=12310865&fixfor=12314533

They are a good mix of
 - fundamental internals changes that we have been putting off so far
(#674, #16, and friends)
 - stuff that we really really want to make ops better (#44)
 - pie in the sky new features (#749)
 - incremental improvements to what we already have

The primary pain source from the client perspective is going to be the
internals changes, particularly moving row keys from String to byte[].
 But it's a change we've know need need to make and I think it's time
to bite the bullet.

Also, if we were to execute on all the tickets there, 0.7 would be
this huge monster release that will take like 6 months to get out.  i
think that's too long.  Shipping is feature #1 at this stage, I'm
really scared of biting off too much and losing weeks or months to
that.

So what I'd like to propose is making 0.7 primarily about the
internals changes and push for high-level queries in 0.8, where both
of those hit our usual ~3 month release cycle.  I don't think it makes
sense to do those the other way around; introducing new APIs that we
already know we need to break just seems mean. :)

-Jonathan


Re: loadbalance and different strategies

2010-02-09 Thread Jonathan Ellis
On Tue, Feb 9, 2010 at 10:22 PM, Jaakko  wrote:
> Yes, that is of course true. However, I don't think this modification
> would make the algorithm much less simple. We still consider the most
> loaded node only, but take into account which DC the node is in.
> Without that extra step, loadbalance only works for rack unaware. If
> we make this change, nothing would change for rack unaware, but for
> other strategies things would be better, I think.

Let's give it a try!

-Jonathan


Re: loadbalance and different strategies

2010-02-09 Thread Jonathan Ellis
On Tue, Feb 9, 2010 at 9:45 PM, Jonathan Ellis  wrote:
> That seems reasonable, although it feels a little weird for X to as G
> for a token and be given one that G isn't the primary for.

"for X to ask* G"


Re: loadbalance and different strategies

2010-02-09 Thread Jonathan Ellis
On Tue, Feb 9, 2010 at 6:12 PM, Jaakko  wrote:
> Let us suppose that all ranges are equal in size. In this case G's
> range is A-G. If X boots in G's DC, it should take a token in the
> middle of this range, which would be somewhere around D. If X boots
> behind D

Ah, I see, you are saying, "G has replicas from A-G, so really it
should take a pare of E's range instead of G's."

That seems reasonable, although it feels a little weird for X to as G
for a token and be given one that G isn't the primary for.

> Yes, alternating the nodes is certainly the best. However, two DCs
> don't always have the same number of nodes. Also, currently
> loadbalance is unusable in such environment.

You're always going to have situations where a simple algorithm does
the "wrong" thing though, which is why we leave the raw move command
exposed.

-Jonathan


Re: loadbalance and different strategies

2010-02-09 Thread Jonathan Ellis
On Tue, Feb 9, 2010 at 6:40 PM, Coe, Robin  wrote:
> Am I correct in assuming that a node given the flush command will not accept 
> new writes

No.  It's not designed to do that.


Re: loadbalance and different strategies

2010-02-09 Thread Jonathan Ellis
On Tue, Feb 9, 2010 at 3:13 AM, Jaakko  wrote:
> What they probably should do, is to just
> consider nodes in the DC they are booting to, and try to balance load
> evenly in that DC.

I'm not sure what problem that would solve.  It seems to me there are two goals:

 1. don't transfer data across data centers
 2. improve ring balance when you add nodes

(1) should always be the case no matter where on the ring the node is
since there will be at least one replica of each range in each DC.

(2) is where we get into trouble here no matter which DC we add to.
 (a) if we add to G's DC, X will get all the replicas G has, remaining
unbalanced
 (b) if we add to the other DC, G will still be hit from all the
replicas from the other DC

So ISTM that the only real solution is to do what we say in the
Operations page, and make sure that nodes on the ring alternate DCs.
I don't think only considering nodes in the same DC helps with that.

-Jonathan


Re: get_range_slice() tester

2010-02-08 Thread Jonathan Ellis
I'm seeing failures on 0.5 but success against trunk, is that also what you see?

-Jonathan

On Mon, Feb 8, 2010 at 4:42 PM, Jack Culpepper  wrote:
> On Mon, Feb 8, 2010 at 2:34 PM, Jonathan Ellis  wrote:
>> This is supposed to pass on a single node but fail on two, correct?
>
> Yep! At least, it does for me.
>
>> What are the tokens on your two nodes, in case that is relevant?
>> (nodeprobe ring will tell you.)
>
> Heh, unfortunately this also shows the fact that I accidentally
> blasted one of my data dirs. ;)
>
> $ sudo bin/nodeprobe -host localhost ring
> [sudo] password for jack:
> Address       Status     Load          Range
>           Ring
>                                       YQVhw0uDS4RMOASI
> 10.212.87.165 Up         8.18 KB       13DyIzn2EhRAHOq9
>           |<--|
> 10.212.230.176Up         11.71 GB      YQVhw0uDS4RMOASI
>           |-->|
>
> J
>
>> -Jonathan
>>
>> On Mon, Feb 8, 2010 at 4:01 PM, Jack Culpepper  
>> wrote:
>>> Here's a tester program, for contrib. It generates 10 keys using uuid,
>>> inserts them both into the cassandra column family Keyspace1/Super1
>>> and a python dictionary. Then, it does a range scan using both methods
>>> and marks the keys that are returned. Finally, it goes through the
>>> python dictionary, makes sure a cassandra get() on each key works
>>> (should through an exception on failure), and complains about keys
>>> that were not found in the range scan.
>>>
>>> To run, put the contents in test_bug.py then run like this:
>>>
>>> python test_bug.py get_key_range
>>>
>>> (Nothing printed means it worked.)
>>>
>>> python test_bug.py get_range_slice
>>>
>>> (Keys that should have been found in a range scan, but were not, are 
>>> printed.)
>>>
>>> Best,
>>>
>>> Jack
>>>
>>>
>>>
>>> import sys
>>> import time
>>> import uuid
>>>
>>> from thrift import Thrift
>>> from thrift.transport import TTransport
>>> from thrift.transport import TSocket
>>> from thrift.protocol.TBinaryProtocol import TBinaryProtocolAccelerated
>>> from cassandra import Cassandra
>>> from cassandra.ttypes import *
>>>
>>> num_keys = 10
>>>
>>> socket = TSocket.TSocket("10.212.87.165", 9160)
>>> transport = TTransport.TBufferedTransport(socket)
>>> protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport)
>>> client = Cassandra.Client(protocol)
>>>
>>> ks = "Keyspace1"
>>> cf = "Super1"
>>> cl = ConsistencyLevel.ONE
>>>
>>> d = {}
>>>
>>> transport.open()
>>>
>>> if 1:
>>>    ## insert keys using the raw thrift interface
>>>    cpath = ColumnPath(cf, "foo", "is")
>>>    value = "cool"
>>>
>>>    for i in xrange(num_keys):
>>>        ts = time.time()
>>>        key = uuid.uuid4().hex
>>>        client.insert(ks, key, cpath, value, ts, cl)
>>>        d[key] = 1
>>>
>>> else:
>>>    ## insert keys using pycassa!
>>>    import pycassa
>>>
>>>    client = pycassa.connect(["10.212.87.165:9160"])
>>>    cf_test = pycassa.ColumnFamily(client, ks, cf, super=True)
>>>
>>>    for i in xrange(num_keys):
>>>        key = uuid.uuid4().hex
>>>        cf_test.insert(key, { 'params' : { 'is' : 'cool' }})
>>>        d[key] = 1
>>>
>>>
>>> cparent = ColumnParent(column_family=cf)
>>> slice_range = SliceRange(start="key", finish="key")
>>> p = SlicePredicate(slice_range=slice_range)
>>>
>>> done = False
>>> seg = 1000
>>> start = ""
>>>
>>> ## do a scan using either get_key_range() (deprecated) or get_range_slice()
>>> ## for every key returned that is in the dictionary, mark it as found
>>> while not done:
>>>    if sys.argv[1] == "get_key_range":
>>>        result = client.get_key_range(ks, cf, start, "", seg, cl)
>>>
>>>        if len(result) < seg: done = True
>>>        else: start = result[seg-1]
>>>
>>>        for r in result:
>>>            if d.has_key(r):
>>>                d[r] = 0
>>>
>>>    if sys.argv[1] == "get_range_slice":
>>>        result = client.get_range_slice(ks, cparent, p, start, "", seg, cl)
>>>
>>>        if len(result) < seg: done = True
>>>        else: start = result[seg-1].key
>>>
>>>        for r in result:
>>>            if d.has_key(r.key):
>>>                d[r.key] = 0
>>>
>>> cpath = ColumnPath(column_family=cf, super_column='foo')
>>>
>>> ## get, remove all the keys
>>> ## print all the keys that were not marked 0
>>> for k in d:
>>>    result = client.get(ks, k, cpath, cl)
>>>    #print result
>>>
>>>    if d[k] == 1:
>>>        print k, "not marked 0"
>>>    #else:
>>>    #    print k, "was marked 0!"
>>>
>>>    ts = time.time()
>>>    client.remove(ks, k, cpath, ts, cl)
>>>
>>
>


Re: get_range_slice() tester

2010-02-08 Thread Jonathan Ellis
This is supposed to pass on a single node but fail on two, correct?

What are the tokens on your two nodes, in case that is relevant?
(nodeprobe ring will tell you.)

-Jonathan

On Mon, Feb 8, 2010 at 4:01 PM, Jack Culpepper  wrote:
> Here's a tester program, for contrib. It generates 10 keys using uuid,
> inserts them both into the cassandra column family Keyspace1/Super1
> and a python dictionary. Then, it does a range scan using both methods
> and marks the keys that are returned. Finally, it goes through the
> python dictionary, makes sure a cassandra get() on each key works
> (should through an exception on failure), and complains about keys
> that were not found in the range scan.
>
> To run, put the contents in test_bug.py then run like this:
>
> python test_bug.py get_key_range
>
> (Nothing printed means it worked.)
>
> python test_bug.py get_range_slice
>
> (Keys that should have been found in a range scan, but were not, are printed.)
>
> Best,
>
> Jack
>
>
>
> import sys
> import time
> import uuid
>
> from thrift import Thrift
> from thrift.transport import TTransport
> from thrift.transport import TSocket
> from thrift.protocol.TBinaryProtocol import TBinaryProtocolAccelerated
> from cassandra import Cassandra
> from cassandra.ttypes import *
>
> num_keys = 10
>
> socket = TSocket.TSocket("10.212.87.165", 9160)
> transport = TTransport.TBufferedTransport(socket)
> protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport)
> client = Cassandra.Client(protocol)
>
> ks = "Keyspace1"
> cf = "Super1"
> cl = ConsistencyLevel.ONE
>
> d = {}
>
> transport.open()
>
> if 1:
>    ## insert keys using the raw thrift interface
>    cpath = ColumnPath(cf, "foo", "is")
>    value = "cool"
>
>    for i in xrange(num_keys):
>        ts = time.time()
>        key = uuid.uuid4().hex
>        client.insert(ks, key, cpath, value, ts, cl)
>        d[key] = 1
>
> else:
>    ## insert keys using pycassa!
>    import pycassa
>
>    client = pycassa.connect(["10.212.87.165:9160"])
>    cf_test = pycassa.ColumnFamily(client, ks, cf, super=True)
>
>    for i in xrange(num_keys):
>        key = uuid.uuid4().hex
>        cf_test.insert(key, { 'params' : { 'is' : 'cool' }})
>        d[key] = 1
>
>
> cparent = ColumnParent(column_family=cf)
> slice_range = SliceRange(start="key", finish="key")
> p = SlicePredicate(slice_range=slice_range)
>
> done = False
> seg = 1000
> start = ""
>
> ## do a scan using either get_key_range() (deprecated) or get_range_slice()
> ## for every key returned that is in the dictionary, mark it as found
> while not done:
>    if sys.argv[1] == "get_key_range":
>        result = client.get_key_range(ks, cf, start, "", seg, cl)
>
>        if len(result) < seg: done = True
>        else: start = result[seg-1]
>
>        for r in result:
>            if d.has_key(r):
>                d[r] = 0
>
>    if sys.argv[1] == "get_range_slice":
>        result = client.get_range_slice(ks, cparent, p, start, "", seg, cl)
>
>        if len(result) < seg: done = True
>        else: start = result[seg-1].key
>
>        for r in result:
>            if d.has_key(r.key):
>                d[r.key] = 0
>
> cpath = ColumnPath(column_family=cf, super_column='foo')
>
> ## get, remove all the keys
> ## print all the keys that were not marked 0
> for k in d:
>    result = client.get(ks, k, cpath, cl)
>    #print result
>
>    if d[k] == 1:
>        print k, "not marked 0"
>    #else:
>    #    print k, "was marked 0!"
>
>    ts = time.time()
>    client.remove(ks, k, cpath, ts, cl)
>


Re: nose tests -- run recently?

2010-02-08 Thread Jonathan Ellis
On Mon, Feb 8, 2010 at 3:53 PM, Jack Culpepper  wrote:
> $ python --version
> Python 2.6.4

2.6.4 here, too.

> One more stupid question if you can stand it: when I get "Connection
> reset by peer" on the python side, where should I see the
> corresponding error on the java side?

Connection reset by peer usually means "the other side [here, the
server] closed the connection before it finished reading everything I
sent."  Which isn't supposed to happen with Thrift, so I dunno.  Maybe
you are using framed on server?  system tests assume unframed.  But if
that were the case I would expect everything to error out much
earlier.

-Jonathan


Re: nose tests -- run recently?

2010-02-08 Thread Jonathan Ellis
On Mon, Feb 8, 2010 at 3:02 PM, Jack Culpepper  wrote:
> On Mon, Feb 8, 2010 at 12:38 PM, Jonathan Ellis  wrote:
>> On Mon, Feb 8, 2010 at 2:25 PM, Jack Culpepper  
>> wrote:
>>> Are you running on a platform that doesn't care about capitalization?
>>
>> Yes.  If you're building Thrift on windows you're only the second
>> person I know to have done so. :)
>>
>
> Your platform *cares* about capitalization, right? Maybe I'm missing
> something, but how can you "from Constants import VERSION"? The file
> containing VERSION is called "constants.py" in the tree.

Ah, I see.  Yes, my linux VM is accessing a NTFS fs here.  Fixed in r907797.

> No exceptions. I actually get a different error when I run the tests now.

Sorry, I have no useful suggestions here, other than passing the -x
flag to nosetests to get it to quit at first error.

-Jonathan


Re: nose tests -- run recently?

2010-02-08 Thread Jonathan Ellis
On Mon, Feb 8, 2010 at 2:25 PM, Jack Culpepper  wrote:
> Are you running on a platform that doesn't care about capitalization?

Yes.  If you're building Thrift on windows you're only the second
person I know to have done so. :)

> error: [Errno 104] Connection reset by peer

I've never seen Thrift do that before.  Did you check the server log
for exceptions?

> What if I create a separate set of tests to make sure a running
> cluster is behaving properly? The first test would make sure the
> proper key spaces exist. That could fail with a clear message.
> Subsequently, scan the keys, remove all of them, then scan again. Now,
> either the environment is clean or something failed. Next, do some
> insertions, deletions, scans, etc. This would help me verify that my
> cluster is set up properly.

Sounds more like an in-house testing tool to me but we could evaluate
putting something like that in contrib/.

-Jonathan


Re: nose tests -- run recently?

2010-02-08 Thread Jonathan Ellis
On Mon, Feb 8, 2010 at 3:17 AM, Jack Culpepper  wrote:
> ERROR: Failure: ImportError (No module named Cassandra)

fixed in r907705, btw.


Re: nose tests -- run recently?

2010-02-08 Thread Jonathan Ellis
On Mon, Feb 8, 2010 at 3:17 AM, Jack Culpepper  wrote:
> Ok. Fixed that, but then run nosetests and I get a bunch of other
> errors.. I must be doing something wrong. I just checked out the code
> like 20 mins ago.

I'm going to have to agree with your diagnosis -- everything passes
for me -- but it's hard to help w/o an example of the errors. :)

> Also, I see that the tests are run on a single node rather than a
> multi-node setup. I would be nice if the testing infrastructure
> connected to a running cluster, and the standard storage-conf.xml
> included the column families necessary for testing. We could do that
> by including the test keyspaces in
>
> test/conf/storage-conf.xml
>
> as a part of the default storage-conf.xml that gets installed with the
> system. Best,

It's not as easy as that -- like most tests, the system tests are
independent and need to start with a clean environment. nose has to be
able to restart the servers in between each, and clean out the data &
commitlog directories.

-Jonathan


Re: batch file error on vista

2010-02-04 Thread Jonathan Ellis
The existing batch file works fine for me on windows 7.  So does
Tom's, modified to not hardcode stuff it shouldn't.  (attached, w/
name mangling to make gmail happy).  Can anyone test this on XP?  If
the code we're removing to "shorten lib path for older platforms" is
required for XP we should probably not remove it.

On Thu, Feb 4, 2010 at 10:31 AM, Tom Borthwick  wrote:
> Hello,
>
> Not sure if this should go to the user or dev mailing list, but the
> cassandra.bat file gave me this error on Vista:
>
> C:\programs\apache-cassandra-incubating-0.5.0>bin\cassandra -f
> Invalid parameter - P:
> Starting Cassandra Server
> Listening for transport dt_socket at address: 
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/cassandra/
> service/CassandraDaemon
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.cassandra.service.Cassan
> draDaemon
>        at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>        at java.security.AccessController.doPrivileged(Native Method)
>        at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
>        at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
>        at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
>        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
> Could not find the main class: org.apache.cassandra.service.CassandraDaemon.  
> Pr
> ogram will exit.
>
> The problem is in the handling of the classpath and the 'P:'
> substitution lines. I just set the classpath directly and it worked
> fine. My batch file looks like this now:


Re: bitmap slices

2010-02-04 Thread Jonathan Ellis
2010/2/4 Ted Zlatanov :
> JE> The mask check needs to be done in the Slice Filter, not SP.
>
> Sorry, I don't know what you mean.  Are you referring to
> o.a.c.db.filter.SliceQueryFilter?  So I'd just add an extra parameter to
> the constructor and change the matching logic?

Right, but make it optional.

> JE> Is this actually powerful enough to solve a real problem for you?
>
> Yes!  OR+AND are exactly what I need.
>
> One specific situation: a supercolumn holds byte[] keys representing
> network addresses (IPv4, IPv6, and Infiniband).  I want to do efficient
> queries across them by various netmasks; the netmasks are not trivial
> and need the OR+AND structure.  Right now I do it all on the client
> side.  I can't break things down by key or by supercolumn further
> because I already use the supercolumn as a time (Long) index and the key
> represents the specific colo.

All right, let's give it a try.


Re: bitmap slices

2010-02-03 Thread Jonathan Ellis
It seems to me that the bitmask is only really useful for the
SliceRange predicate.  Doing a predicate of "fetch these column names,
but only if they match this mask" seems strange.

The mask check needs to be done in the Slice Filter, not SP.

Is this actually powerful enough to solve a real problem for you?

-Jonathan

2010/2/3 Ted Zlatanov :
> On Mon, 1 Feb 2010 11:14:12 -0600 Jonathan Ellis  wrote:
>
> JE> 2010/2/1 Ted Zlatanov :
>>> On Mon, 1 Feb 2010 10:41:28 -0600 Jonathan Ellis  wrote:
>>>
> JE> I don't think this is very useful for column names.  I could see it
> JE> being useful for values but if we're going to add predicate queries
> JE> then I'd rather do something more general.
>>>
>>> Do you have any ideas?
>
> JE> Not really, no.  I think we're best served developing feature X by
> JE> starting with problems that can only be solved with X and working from
> JE> there.  Going the other direction is asking for trouble.
>
> I looked at the filters, e.g. o.a.c.db.filter.SliceQueryFilter, and it
> seems like one place to put predicate logic is in that hierarchy.
> Perhaps there can be a PredicateQueryFilter.  Some thought has
> apparently already gone into flexible filters at the storage level.  I
> hope something happens in this direction but I won't push for it
> further since it's not what I need.
>
> The attached patch is how I propose to do bitmasks inside the
> SlicePredicate.  As you suggested, it solves the specific problem.  It's
> pretty simple and carries no performance penalty if bitmasks are not
> used.  It's untested, intended to show the interface and approach I am
> proposing.  I didn't open an issue since it's unclear that this is the
> way to go.
>
> Thanks
> Ted
>
>


Re: column sizes (was: online codes (?))

2010-02-03 Thread Jonathan Ellis
That's correct.

On Wed, Feb 3, 2010 at 4:49 PM, Michael Pearson  wrote:
> Thanks for the Gossip note, I'll keep reading up on the protocols.
> For key/column/disk I meant in terms of the Cassandra limitation -
>
> "The main limitation on column and supercolumn size is that all data
> for a single key and column must fit (on disk) on a single machine in
> the cluster."
>
> Is it right to think an entire supercolumn (so, possibly a very wide
> supercolumn of large object columns depending on the applications data
> model) needs to fit on a single node or am I off the mark.
>
> -Michael
>
> On Thu, Feb 4, 2010 at 8:28 AM, Jonathan Ellis  wrote:
>> On Wed, Feb 3, 2010 at 4:25 PM, Michael Pearson  wrote:
>>> I'd imagine the gossip overhead and key/column per disk limitation is
>>> too open for abuse to recommend storing lob columns with any level of
>>> predictability, particularly if frequent updates are involved.
>>
>> Gossip overhead is constant for a given cluster size.  What do you
>> mean by key/column per disk limitation?
>>
>> -Jonathan
>>
>


Re: column sizes (was: online codes (?))

2010-02-03 Thread Jonathan Ellis
On Wed, Feb 3, 2010 at 4:25 PM, Michael Pearson  wrote:
> I'd imagine the gossip overhead and key/column per disk limitation is
> too open for abuse to recommend storing lob columns with any level of
> predictability, particularly if frequent updates are involved.

Gossip overhead is constant for a given cluster size.  What do you
mean by key/column per disk limitation?

-Jonathan


Re: column sizes

2010-02-03 Thread Jonathan Ellis
it puts a limit on result set size no matter what kind of columns are
in the result

2010/2/3 Ted Zlatanov :
> On Wed, 3 Feb 2010 07:23:06 -0600 Jonathan Ellis  wrote:
>
> JE> At least one person is putting in chunks of up to 64MB, so at some
> JE> level it "works" but it's not what it's designed for.
>
> 64MB is pretty decent for my needs actually.  I can segment the data
> into multiple columns and don't necessarily want all of it loaded at
> once anyhow.  The result set from a ReadCommand is all in memory though
> IIUC, so does that put a limit on the total result set size within a
> SuperColumn?
>
> Ted
>
>


Re: column sizes (was: online codes (?))

2010-02-03 Thread Jonathan Ellis
At least one person is putting in chunks of up to 64MB, so at some
level it "works" but it's not what it's designed for.

2010/2/3 Ted Zlatanov :
> On Tue, 2 Feb 2010 23:05:04 -0600 Jonathan Ellis  wrote:
>
> JE> The "atom" in cassandra is a single column.  These are almost always
> JE> under 1KB.
>
> Is there any point to storing large objects (over 100MB) in Cassandra
> columns?  I'm considering it but it seems like a bad idea based on my
> reading of the source and experience so far.  If I could do it it would
> eliminate my need for shared resources (NAS, web server, etc.) to serve
> those objects.
>
> Ted
>
>


Re: online codes (?)

2010-02-02 Thread Jonathan Ellis
On Tue, Feb 2, 2010 at 9:05 PM, Anthony Di Franco
 wrote:
> Taking the discussion below to the dev list.
>
> Continuing the discussion, it seems to me that objects in Cassandra
> might be quite large from this passage:

You've misunderstood.  The "atom" in cassandra is a single column.
These are almost always under 1KB.

> So there is (apparently) lots of room for objects to become large with
> respect to the size of the storage overhead of a single piece, at
> which time using an online code could provide significant space
> savings for a given level of resiliency.

Trading space for latency is the wrong trade for us, even if erasure
codes were a good fit for us architecturally, which they are not.
Please read the links I referenced in the ticket for more background.

-Jonathan


Re: predicate queries (was: bitmap slices)

2010-02-01 Thread Jonathan Ellis
2010/2/1 Ted Zlatanov :
> My list of things I need for predicate queries across column and
> supercolumn names:
>
> - bitmask (OR AND1 AND2 AND3 ...).  This would make my life easier and
>  take load off our Cassandra servers.  Currently I have to scan the
>  result sets on the client side to find the things I need.

here's the thing, though, and this is a main reason why cassandra
keeps things simple: doing work on the client typically results in
*less* load on the server.

> FWIW I'd like an entirely text-based query language like SQL

I think this is a non-starter.  It's pretty clear that the way forward
is towards more programmatic apis, not clients translating their
requests to strings which the server then parses.

-Jonathan


Re: bitmap slices

2010-02-01 Thread Jonathan Ellis
2010/2/1 Ted Zlatanov :
> On Mon, 1 Feb 2010 10:41:28 -0600 Jonathan Ellis  wrote:
>
> JE> I don't think this is very useful for column names.  I could see it
> JE> being useful for values but if we're going to add predicate queries
> JE> then I'd rather do something more general.
>
> Do you have any ideas?

Not really, no.  I think we're best served developing feature X by
starting with problems that can only be solved with X and working from
there.  Going the other direction is asking for trouble.

-Jonathan


Re: bitmap slices

2010-02-01 Thread Jonathan Ellis
I don't think this is very useful for column names.  I could see it
being useful for values but if we're going to add predicate queries
then I'd rather do something more general.

2010/2/1 Ted Zlatanov :
> On Mon, 1 Feb 2010 09:42:16 -0600 Jonathan Ellis  wrote:
>
> JE> 2010/2/1 Ted Zlatanov :
>>> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov  wrote:
>>>
> TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis  
> wrote:
> JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
> JE>  wrote:
>>>>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my 
>>>>>> database
>>>>>>   size would scale up tremendously.
>>>
> JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
> JE> other things fast.
>>>
> TZ> IIUC, Mehar would be duplicating the article data for every article tag.
>>>
> TZ> I searched the bug tracker and wiki and didn't find anything on the
> TZ> topic of tag storage and search, so I don't think Cassandra supports
> TZ> tags without data duplication.
>>>
> TZ> Would it be possible to implement an optional byte[] bitmap field in
> TZ> SliceRange?  If you can specify the bitmap as an optional field it would
> TZ> not break current clients.  Then the search can return only the subset
> TZ> of the range that matches the bitmap.  This would make sense for
> TZ> BytesType and LongType, at least.
>>>
>>> I looked at the source code and it seems that
>>> StorageProxy::getSliceRange() is the focal point for reads and bitmap
>>> matching should be implemented there.  The bitmap could be applied as a
>>> filter before the other SliceRange parameters, especially the max number
>>> of return results.  It may be worth the effort to send the bitmap down
>>> to the ReadCommand/ColumnFamily level to reduce the number of potential
>>> matches.
>>>
>>> If this is not feasible for technical reasons I'd like to know.
>>> Otherwise I'll put it on my TODO list and produce a proposal (unless
>>> someone more knowledgeable is interested, of course).
>
> JE> how would this be different then the byte[] column name you can
> JE> already match on?
>
> Given byte columns
>
> A 0110
> B 0111
> C 0101
>
> the bitmask approach would let you specify a bitmask of "0011" and get
> only B.  It's just an AND that looks for a non-zero value.  So you can
> say "0111" and get A, B, and C.  Or "0010" to get A and B.  "1000" gets
> nothing.
>
> Cassandra could support OR-ed multiples for better queries, so you could
> ask for (0001,0010) to get A, B, and C.
>
> Ted
>
>


Re: bitmap slices

2010-02-01 Thread Jonathan Ellis
how would this be different then the byte[] column name you can
already match on?

2010/2/1 Ted Zlatanov :
> On Fri, 29 Jan 2010 15:07:01 -0600 Ted Zlatanov  wrote:
>
> TZ> On Fri, 29 Jan 2010 12:06:28 -0600 Jonathan Ellis  
> wrote:
> JE> On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
> JE>  wrote:
>>>>   1. This would lead to enourmous amount of duplication of data, in short
>>>>   if I now want to view the data from IS_PUBLISHED dimenstion then my 
>>>> database
>>>>   size would scale up tremendously.
>
> JE> Yes.  But disk space is so cheap it's worth using a lot of it to make
> JE> other things fast.
>
> TZ> IIUC, Mehar would be duplicating the article data for every article tag.
>
> TZ> I searched the bug tracker and wiki and didn't find anything on the
> TZ> topic of tag storage and search, so I don't think Cassandra supports
> TZ> tags without data duplication.
>
> TZ> Would it be possible to implement an optional byte[] bitmap field in
> TZ> SliceRange?  If you can specify the bitmap as an optional field it would
> TZ> not break current clients.  Then the search can return only the subset
> TZ> of the range that matches the bitmap.  This would make sense for
> TZ> BytesType and LongType, at least.
>
> I looked at the source code and it seems that
> StorageProxy::getSliceRange() is the focal point for reads and bitmap
> matching should be implemented there.  The bitmap could be applied as a
> filter before the other SliceRange parameters, especially the max number
> of return results.  It may be worth the effort to send the bitmap down
> to the ReadCommand/ColumnFamily level to reduce the number of potential
> matches.
>
> If this is not feasible for technical reasons I'd like to know.
> Otherwise I'll put it on my TODO list and produce a proposal (unless
> someone more knowledgeable is interested, of course).
>
> Ted
>
>


Re: Understand how to provision nodes and use cassandra in the production

2010-01-30 Thread Jonathan Ellis
you want to get hitrate to 0.9 or so, i.e. 90% of index lookups don't
have to hit disk.  play with KCF and see what happens.  and use
jconsole to see how close you are getting to your 3GB limit (hit the
GC button to see how much memory is "really" being used, and then add
25% or so for a reasonable padding).

On Sat, Jan 30, 2010 at 5:46 PM, Suhail Doshi  wrote:
> According jconsole on the main table I am having issues with:
>
> Capacity: 1164790
> HitRate: .54
> Size: 99753
>
> Right now my KeysCachedFraction is 0.2. The current memory allocated is 3G.
> What's a suggested KeysCachedFraction value?
>
> Suhail
>
> On Sat, Jan 30, 2010 at 5:58 AM, Jonathan Ellis  wrote:
>
>> the thing that will help most in 0.5 is to increase your
>> KeysCachedFraction to 0.2 or even more, depending on your workload.
>>
>> On Sat, Jan 30, 2010 at 5:23 AM, Suhail Doshi 
>> wrote:
>> > An issue I've been seeing is it's really hard to scale Cassandra with
>> reads.
>> > I've run top, vmstat, iostat. vmstat shows no swapping but iostat shows
>> > heavy saturation of %util and await times over 90ms with max rMB/s of
>> 7-8.
>> >
>> > I have over 7G of memory dedicated across two nodes. I am wondering what
>> the
>> > issue might be and how to solve this? I felt like 7 G would be enough.
>> >
>> > Suhail
>> >
>> > On Thu, Jan 28, 2010 at 7:32 PM, Ray Slakinski  wrote:
>> >
>> >> Cassandra auto shards, so you just need to point at your cluster and
>> >> cassandra does the rest. You should read up on different partitioners
>> though
>> >> before you go live in production, because its not too easy to switch
>> once
>> >> you make that decision.
>> >>
>> >> http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner
>> >>
>> >> Ray Slakinski
>> >> On 2010-01-28, at 7:29 PM, Suhail Doshi wrote:
>> >>
>> >> > Another piece I am interested in is how cassandra distributes the data
>> >> > automatically. In MySQL you need to shard and you'd pick the shard to
>> >> > request info from--how does that translate in cassandra?
>> >> >
>> >> > On Thu, Jan 28, 2010 at 7:23 PM, Suhail Doshi 
>> >> wrote:
>> >> >
>> >> >> We've started to use Cassandra in production and just have one node
>> >> right
>> >> >> now. Here's one of our ColumnFamilys:
>> >> >>
>> >> >> 16G Jan 28 22:28 SomeIndex-5467-Index.db
>> >> >> 196M Jan 28 22:32 SomeIndex-5487-Index.db
>> >> >>
>> >> >> The first bottle neck you encounter is reads--writes are extremely
>> fast
>> >> even with one node.
>> >> >>
>> >> >> My question is, is the size of the *-Index.db files the amount of RAM
>> >> you need available for Cassandra to do reads fast?
>> >> >>
>> >> >> What are some configuration options you would need to tweak besides
>> the
>> >> JVM's max memory size being larger. Is there any default configurations
>> >> commonly missed?
>> >> >>
>> >> >> Next, if you provision more nodes will Cassandra distribute the data
>> in
>> >> memory so I don't need a single 16 GB node? Is there anything I need to
>> >> build in my application logic to make this work correctly. Ideally, if I
>> had
>> >> a 16 GB index, I'd want it spread across 4 4GB nodes. Can any client
>> connect
>> >> to any one node request info and it will get the info back from a node
>> that
>> >> has that part of the index in memory?
>> >> >>
>> >> >> What's the best way to do efficient reads?
>> >> >>
>> >> >> Suhail
>> >> >>
>> >> >>
>> >>
>> >>
>> >
>> >
>> > --
>> > http://mixpanel.com
>> > Blog: http://blog.mixpanel.com
>> >
>>
>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>


Re: Understand how to provision nodes and use cassandra in the production

2010-01-30 Thread Jonathan Ellis
the thing that will help most in 0.5 is to increase your
KeysCachedFraction to 0.2 or even more, depending on your workload.

On Sat, Jan 30, 2010 at 5:23 AM, Suhail Doshi  wrote:
> An issue I've been seeing is it's really hard to scale Cassandra with reads.
> I've run top, vmstat, iostat. vmstat shows no swapping but iostat shows
> heavy saturation of %util and await times over 90ms with max rMB/s of 7-8.
>
> I have over 7G of memory dedicated across two nodes. I am wondering what the
> issue might be and how to solve this? I felt like 7 G would be enough.
>
> Suhail
>
> On Thu, Jan 28, 2010 at 7:32 PM, Ray Slakinski  wrote:
>
>> Cassandra auto shards, so you just need to point at your cluster and
>> cassandra does the rest. You should read up on different partitioners though
>> before you go live in production, because its not too easy to switch once
>> you make that decision.
>>
>> http://wiki.apache.org/cassandra/StorageConfiguration#Partitioner
>>
>> Ray Slakinski
>> On 2010-01-28, at 7:29 PM, Suhail Doshi wrote:
>>
>> > Another piece I am interested in is how cassandra distributes the data
>> > automatically. In MySQL you need to shard and you'd pick the shard to
>> > request info from--how does that translate in cassandra?
>> >
>> > On Thu, Jan 28, 2010 at 7:23 PM, Suhail Doshi 
>> wrote:
>> >
>> >> We've started to use Cassandra in production and just have one node
>> right
>> >> now. Here's one of our ColumnFamilys:
>> >>
>> >> 16G Jan 28 22:28 SomeIndex-5467-Index.db
>> >> 196M Jan 28 22:32 SomeIndex-5487-Index.db
>> >>
>> >> The first bottle neck you encounter is reads--writes are extremely fast
>> even with one node.
>> >>
>> >> My question is, is the size of the *-Index.db files the amount of RAM
>> you need available for Cassandra to do reads fast?
>> >>
>> >> What are some configuration options you would need to tweak besides the
>> JVM's max memory size being larger. Is there any default configurations
>> commonly missed?
>> >>
>> >> Next, if you provision more nodes will Cassandra distribute the data in
>> memory so I don't need a single 16 GB node? Is there anything I need to
>> build in my application logic to make this work correctly. Ideally, if I had
>> a 16 GB index, I'd want it spread across 4 4GB nodes. Can any client connect
>> to any one node request info and it will get the info back from a node that
>> has that part of the index in memory?
>> >>
>> >> What's the best way to do efficient reads?
>> >>
>> >> Suhail
>> >>
>> >>
>>
>>
>
>
> --
> http://mixpanel.com
> Blog: http://blog.mixpanel.com
>


Re: Is this possible with cassandra

2010-01-29 Thread Jonathan Ellis
On Fri, Jan 29, 2010 at 9:09 AM, Mehar Chaitanya
 wrote:
>   1. This would lead to enourmous amount of duplication of data, in short
>   if I now want to view the data from IS_PUBLISHED dimenstion then my database
>   size would scale up tremendously.

Yes.  But disk space is so cheap it's worth using a lot of it to make
other things fast.

>   2. Above way of reprensting the data would suffice if I want to retrieve
>   something like, get me all the articles whose category is WORLDNEWS. But
>   what if I want to find out something like: Get me all the articles whose
>   Section is BASEBALL and Category is WORLDNEWS. For addressing queries taht
>   depend on multiple parameter how do we do it? Hope I am clear with my
>   problem statement :(

You have to do the intersection client-side (or use something like
http://github.com/tjake/Lucandra to do it for you).

-Jonathan


Re: Is this possible with cassandra

2010-01-29 Thread Jonathan Ellis
Cassandra does not support ad-hoc queries the way SQL does.  If you
want to ask "what rows have a column X containing value Y" then you
need to create a columnfamily whose keys are the values of X, and
whose columns are the keys of your original CF.

Read http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model if
you haven't yet.

On Fri, Jan 29, 2010 at 6:16 AM, Mehar Chaitanya
 wrote:
> Hi All
>
> I am a J2EE programmer only i had knowledge related to queries i will query
> the sql where i can found the result.
>
> How can i use cassandra for my requirement is it possible?
>
> Below is my scenario
>
>   - I have a table which contains columns like
>   Category_name,Section_name,article,is_published_by  with  multiple records
>   in the table.
>   - I want to retrieve a query based on condition like belongs some
>   category_name 'X' .
>   - Same will be applied to other 3 ,condition based on Section and
>   is_published_by
>
>
> Please let me know if it would be possible.
>
> Thanks&Regards,
> Mehar Chaitanya Bandaru,
> Software Engineer,
> S cubes IT Solutions India Pvt. Ltd.,
> http://www.scubian.com
> (W) +91 4040307821,
> (Cell) +91 9440 999 262,
> #4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.
>


Re: Understand how to provision nodes and use cassandra in the production

2010-01-28 Thread Jonathan Ellis
On Thu, Jan 28, 2010 at 9:23 PM, Suhail Doshi  wrote:
> We've started to use Cassandra in production and just have one node right
> now. Here's one of our ColumnFamilys:
>
>
> 16G Jan 28 22:28 SomeIndex-5467-Index.db
> 196M Jan 28 22:32 SomeIndex-5487-Index.db
>
> The first bottle neck you encounter is reads--writes are extremely
> fast even with one node.
>
> My question is, is the size of the *-Index.db files the amount of RAM
> you need available for Cassandra to do reads fast?

No.  It depends how much of your data is "hot."  IIRC you are running
trunk -- look at the key cache hit rate with various KeyCacheFractions
and see how large it has to be to get an 80% hit rate or so.

> Next, if you provision more nodes will Cassandra distribute the data
> in memory so I don't need a single 16 GB node?

Yes.  See the Ring Management section here:
http://wiki.apache.org/cassandra/Operations

> Can any client connect to any one node request info and it will
> get the info back from a node that has that part of the index in
> memory?

Yes.


Re: How to write insert query in Cassandra

2010-01-28 Thread Jonathan Ellis
i believe cassandra_browser in contrib/ can do inserts with a gui, but
it's nowhere near as mature as what you would see for mysql.

you will also want to read http://wiki.apache.org/cassandra/API and
http://wiki.apache.org/cassandra/ClientExamples and probably
http://wiki.apache.org/cassandra/ThriftInterface

On Thu, Jan 28, 2010 at 7:24 AM, Mehar Chaitanya
 wrote:
>
> Hi All
>
> I have configured Cassandra in my PC
>
> I want to insert some data into it is there any graphical interface for
> doing this like we do in Mysql with hep GUI tools etc..,
>
> Suppose I want to insert a data like below How can i do that
>
>
> UserList = {
>         John: {
>         username: "john",
>         email: "j...@blah.com"
>             },
>         Smith: {
>         username: "smith",
>         email: "sm...@blah.com"
>             }
>     }
>
>
> Can any one help me
>
> --
> The difference between possible and impossible lies in person's
> determination.
>
> Thanks&Regards,
> Mehar Chaitanya Bandaru,
> Software Engineer,
> S cubes IT Solutions India Pvt. Ltd.,
> http://www.scubian.com
> (W) +91 4040307821,
> (Cell) +91 9440 999 262,
> #4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.
>


Re: EOFException after upgrading to 0.5.0

2010-01-28 Thread Jonathan Ellis
please read NEWS.txt, both of your problems are covered there (flush
your commitlog, and don't mix  0.4 and 0.5 nodes in the same cluster)

On Thu, Jan 28, 2010 at 6:44 AM, B R  wrote:
> Hi guys,
>
> We are in the process of upgrading from Cassandra 0.4.2 to 0.5.0 The first
> issue I faced was :
>
> java.lang.OutOfMemoryError: Java heap space
>  at org.apache.cassandra.db.CommitLog.recover(CommitLog.java:318)
>  at
> org.apache.cassandra.db.RecoveryManager.doRecovery(RecoveryManager.java:65)
>  at
> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:90)
>  at
> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:166)
>
> After search the mailing lists, I found that it was because of an issue with
> commit log format. After creating a new file, I could proceed to run a test
> sample to insert and read back data. There are no problems with the
> application, but in the system log, I have come across hundreds of entries
> of the following type.
>
> WARN [UDP Selector Manager] 2010-01-28 18:02:02,828 UdpConnection.java (line
> 152) Exception was generated at : 01/28/2010 18:02:02 on thread UDP Selector
> Manager
>
> java.io.EOFException
>    at java.io.DataInputStream.readFully(DataInputStream.java:180)
>    at java.io.DataInputStream.readUTF(DataInputStream.java:592)
>    at java.io.DataInputStream.readUTF(DataInputStream.java:547)
>    at
> org.apache.cassandra.net.HeaderSerializer.deserialize(Header.java:160)
>    at
> org.apache.cassandra.net.HeaderSerializer.deserialize(Header.java:133)
>    at
> org.apache.cassandra.net.MessageSerializer.deserialize(Message.java:156)
>    at
> org.apache.cassandra.net.MessageSerializer.deserialize(Message.java:144)
>    at org.apache.cassandra.net.UdpConnection.read(UdpConnection.java:143)
>    at
> org.apache.cassandra.net.SelectorManager.doProcess(SelectorManager.java:149)
>    at
> org.apache.cassandra.net.SelectorManager.run(SelectorManager.java:107)
>
> Please suggest what could be the reason for this ?
>
> Thanks,
>
> ~B
>


Re: thinking about dropping hinted handoff

2010-01-27 Thread Jonathan Ellis
On Wed, Jan 27, 2010 at 1:48 PM, Stu Hood  wrote:
>> The HH code currently tries to send the hints to nodes other than the
>> natural endpoints. If small-scale performance is a problem, we could
>> make the natural endpoints be responsible for the hints. This reduces
>> durability a bit, but might be a decent tradeoff.
> The other interesting benefit is that the hint would not need to store the 
> actual content of the change, since the natural endpoints will already be 
> storing copies. The hints would just need to encode the fact that a given 
> (key,name1[,name2]) changed.

Right, I think that's what Ryan was getting at.


Re: thinking about dropping hinted handoff

2010-01-27 Thread Jonathan Ellis
On Wed, Jan 27, 2010 at 1:12 PM, Ryan King  wrote:
> On Wed, Jan 27, 2010 at 8:34 AM, Jonathan Ellis  wrote:
>> While being able to write (with CL.ZERO or new-in-0.6 ANY) even if all
>> the real write targets are down is cool, but since your goal in real
>> life is to keep enough replicas alive that you can actually do reads,
>> I'm not sure how useful it is.  HH also has a measurable performance
>> problem in small clusters (that is, where cluster size is not much
>> larger than replication factor) since having a node go down means you
>> will increase the write load on the remaining nodes a non-negligible
>> amount to write the hints, which can be a nasty surprise if you
>> weren't planning for it.
>
> The HH code currently tries to send the hints to nodes other than the
> natural endpoints. If small-scale performance is a problem, we could
> make the natural endpoints be responsible for the hints. This reduces
> durability a bit, but might be a decent tradeoff.

That is a good idea, I think we should make that change if we want to keep HH.

-Jonathan


Re: Google SoC

2010-01-27 Thread Jonathan Ellis
I hadn't thought about that, but it's a great idea.

I imagine the ASF will be a qualified organization once again with no
further work necessary on our part in that area, so all we'd need to
do would be come up with projects of appropriate scope.

Any ideas there?

On Wed, Jan 27, 2010 at 10:51 AM, Krishna Sankar  wrote:
> Folks,
>        As many of you might have seen, the Google SoC 2010 is approaching[1]. 
> Would it be a good idea to start collecting a few ideas and  explore SoC 
> possibilities ?
> 
> [1] 
> http://groups.google.com/group/google-summer-of-code-discuss/browse_thread/thread/d839c0b02ac15b3f


Re: cassandra : How to handle joins

2010-01-27 Thread Jonathan Ellis
Have you read http://arin.me/code/wtf-is-a-supercolumn-cassandra-data-model  ?

On Wed, Jan 27, 2010 at 10:29 AM, Mehar Chaitanya
 wrote:
> Hi Jonathan
>
> Thanks for ur reply
>
> I was wrong in my last posting asking about replication of changes.
>
> I want know actually how the cassandra works on distributed environment. How
> can we migrate mysql to cassandra
>
> From last 3 days i was searching couple of blogs archives about cassandra
> and configured cassandra on my system.
>
> I want to create a schema where some users will register and update their
> blog post personal data etc..
>
> Here i want to query the cassandra to get some huge data by joins.
>
> For this how to design my schemaand designing tables etc.,
>
> awaiting for reply.
>
> --
> Thanks&Regards,
> Mehar Chaitanya Bandaru,
> Software Engineer,
> S cubes IT Solutions India Pvt. Ltd.,
> http://www.scubian.com
> (W) +91 4040307821,
> (Cell) +91 9440 999 262,
> #4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.
>


thinking about dropping hinted handoff

2010-01-27 Thread Jonathan Ellis
While being able to write (with CL.ZERO or new-in-0.6 ANY) even if all
the real write targets are down is cool, but since your goal in real
life is to keep enough replicas alive that you can actually do reads,
I'm not sure how useful it is.  HH also has a measurable performance
problem in small clusters (that is, where cluster size is not much
larger than replication factor) since having a node go down means you
will increase the write load on the remaining nodes a non-negligible
amount to write the hints, which can be a nasty surprise if you
weren't planning for it.

As for HH's consistency-improving characteristics, remember that HH is
not reliable (it's possible for a node to be down for several seconds
before HH gets turned on for it; it's also possible that the node with
the hints itself goes down before the target node comes back up),
which is why we needed the anti-entropy repair code.  So I think you
could make the case that now that we have anti-entropy, read repair
will be sufficient to handle inconsistency on "hot" keys, with
anti-entropy to handle infrequently accessed ones.  (Remembering of
course that if you wanted strong consistency in the first place, you
need to be doing quorum reads and writes and HH doesn't really matter.
 So we are talking about how to reduce inconsistency, when the client
has explicitly told us they're okay with seeing a little.)

Finally, I note that Cliff Moon, the author of Dynomite (probably the
most advanced pure Dynamo clone), deliberately left HH out for I
believe substantially these reasons.  (CC'd in case he wants to chime
in. :)

-Jonathan


Re: Hint storage format.

2010-01-27 Thread Jonathan Ellis
On Wed, Jan 27, 2010 at 10:03 AM, Gary Dusbabek  wrote:
> The context of this discussion comes from CASSANDRA-293.
>
> Since it relies on keys, current hinted handoff scheme isn't going to
> work for when a range-remove operation needs to be hinted for a downed
> node.  The idea I'm playing with now is to use a store-and-forward
> mechanism for hints where the entire message is stored and later sent
> to the destination when the host comes back up.

The main problem with this approach is that since have one hint row
per keyspace it will make the hint rows huge very very quickly, since
you are potentially storing entire rows as a single column (the hint
message).  So this would definitely require CASSANDRA-16 (the
very-large-rows ticket).

Currently we are not fully leveraging our data model -- the HH columns
use the column name as hinted row key, w/ empty value.  If we used a
UUID for column name instead, we could have value be a tuple of (hint
type, hint value), i.e., either (KEY, key string) or (RANGEDELETE,
serialized deletion message).

A third option would be to drop HH entirely.  I'll describe that in
more detail in another message.

-Jonathan


Re: Help regarding cassandra

2010-01-27 Thread Jonathan Ellis
Cassandra supports clusters spanning multiple data centers (see
RackAwareStrategy and contrib/property_snitch), but not replication
between distinct clusters.

On Wed, Jan 27, 2010 at 9:36 AM, Mehar Chaitanya
 wrote:
> Hi All
>
> I was done with installing cassandra and inserting some data into Keyspace
>
> My Problem is with insertion and deletion in distributed environment and
> also maintaining replicas.
>
> My Requirement is
>
> I had some data stored on keySpace1 of System A
>
> I want to create the replica of the keySpace1 on System B
>
> Any updates or deletions done on one system should reflect on the other
>
> How can i do this ?
>
> Waiting for early response . Thanks in advance
>
> --
> Thanks&Regards,
> Mehar Chaitanya Bandaru,
> Software Engineer,
> S cubes IT Solutions India Pvt. Ltd.,
> http://www.scubian.com
> (W) +91 4040307821,
> (Cell) +91 9440 999 262,
> #4-1-319, 2nd Floor, Abids Road, Hyderabad - 01.
>


Re: error in apache-cassandra-incubating-0.5.0-src.tar.gz

2010-01-25 Thread Jonathan Ellis
1, 2: this is because you need to run ant to generate the thrift code
3: this is a warning, not an error

2010/1/25 Lu Ming :
> I downloaded apache-cassandra-incubating-0.5.0-src.tar.gz and imported source 
> files into Eclipse.
> and find three errors;
>
> 1)in org.apache.cassandra.cli.CliClient:         line 53,56... etc:    
> CliParser can not be resolved
> 2)in org.apache.cassandra.cli.CliCompiler:     line 66    :        
> CliLexer&CliParser can not be resolved
> 3)in org.apache.cassandra.service.StorageProxy
> line 542 and line 543 Type mismatch: cannot convert from Object to 
> DecoratedKey
> this error is caused by Collections.max and Collections.min
>
>


Re: [VOTE] Release 0.5.0 (final)

2010-01-18 Thread Jonathan Ellis
No, but we will definitely take a look at it for 0.5.1.

0.5.0 will not be perfect but it is a huge improvement over 0.4.2,
which people are still using because that's the official "stable"
release.  We need to fix that. :)

-Jonathan

On Mon, Jan 18, 2010 at 6:39 PM, Ryan Daum  wrote:
> Any chance of https://issues.apache.org/jira/browse/CASSANDRA-713 getting
> looked at and closed before a 0.5.0 final?
>
> On Mon, Jan 18, 2010 at 7:32 PM, Pablo A. Delgado  wrote:
>
>> +1
>>
>> On Tue, Jan 19, 2010 at 1:19 AM, Jaakko  wrote:
>>
>> > +1
>> >
>> >
>> > On Tue, Jan 19, 2010 at 5:28 AM, Eric Evans 
>> wrote:
>> > >
>> > > There have been a few changes[1] in the 0.5 branch since RC3. In a
>> > perfect
>> > > world, we'd probably push those into another release candidate, but I
>> > > feel pretty good about this one, and any remaining issues can always
>> > > be added to 0.5.1.
>> > >
>> > > I propose the following tag and artifacts for 0.5.0:
>> > >
>> > > SVN Tag:
>> > >
>> >
>> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0
>> > > 0.5.0 artifacts: http://people.apache.org/~eevans
>> > >
>> > > +1 from me.
>> > >
>> > >
>> > > [1]
>> > >
>> >
>> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0/CHANGES.txt
>> > >
>> > > --
>> > > Eric Evans
>> > > eev...@rackspace.com
>> > >
>> > >
>> >
>>
>


Re: [VOTE] Release 0.5.0 (final)

2010-01-18 Thread Jonathan Ellis
+1

On Mon, Jan 18, 2010 at 2:28 PM, Eric Evans  wrote:
>
> There have been a few changes[1] in the 0.5 branch since RC3. In a perfect
> world, we'd probably push those into another release candidate, but I
> feel pretty good about this one, and any remaining issues can always
> be added to 0.5.1.
>
> I propose the following tag and artifacts for 0.5.0:
>
> SVN Tag:
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0
> 0.5.0 artifacts: http://people.apache.org/~eevans
>
> +1 from me.
>
>
> [1]
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0/CHANGES.txt
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: the release after 0.5

2010-01-18 Thread Jonathan Ellis
I've moved the already-committed-as-0.9 issues to 0.6, and created a
new 0.7 version for tickets that do not fit the goal of a quick
release fully compatible with 0.5.

-Jonathan

On Fri, Jan 8, 2010 at 2:06 PM, Jonathan Ellis  wrote:
> In the month since 0.5 was branched, we've already made some
> significant progress, particularly in performance.  I can't find a way
> to easily link the full list in Jira, but these include
>
>  408+669 (mmapping sstables for reads instead of using buffered I/O):
> ~50% speed improvement
>  658 (better write concurrency): ~1000% improvement when cluster is
> in degraded state
>  675 (faster communication between nodes): ~100% improvement of reads
> and writes
>  678 (row level caching): up to 120% improvement of reads (workload dependent)
>
> We also have some other interesting tickets done:
>  336: add the ability to insert data to multiple rows at once -- like
> multiget, but for writes
>  535: expose StorageProxy to use as a fat client (with
> StorageService.initClient)
>  599: give some visibility of what Compaction is busy doing (can be a
> major source of "why is it slow?")
>
> We have a lot of other issues tagged 0.9
> (https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&resolution=-1&pid=12310865&fixfor=12314361),
> but the above issues and the others done so far are already useful
> enough to release, both because it helps our existing users of 0.5
> [note that none of these introduce compatibility issues of any sort,
> which makes upgrading especially easy], and because improving
> performance by that much makes us look better, which helps grow the
> community. :)
>
> On the other hand, creating and stabilizing and testing a new release
> (that is not bugfix-only) is a non-negligible amount of overhead, and
> I would give extra weight to Eric Evans's opinion here as release
> manager.
>
> -Jonathan
>


Re: multiget_slice

2010-01-14 Thread Jonathan Ellis
it sounds like you just don't have enough ram for the OS to cache your
"hot" data set so you are getting killed on disk seeks.  iostat -x 5
(for example) during load should verify this.

On Thu, Jan 14, 2010 at 11:19 AM, Suhail Doshi  wrote:
> Looking at my data directory: 14 G. Just Index.db based files: 4.5 G.
>
> Yes only one node so far.
>
> vmstat -n 1 -S m
> procs ---memory-- ---swap-- -io -system--
> cpu
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id
> wa
>  0  0     22    585     32   2557    0    0    71    31    5    5  1  0 94
>  3
>
> On Thu, Jan 14, 2010 at 10:11 AM, Jonathan Ellis  wrote:
>
>> how much data do you have on disk?  (only on enode?)  how large are
>> the columns you are reading?  how much ram does vmstat say is being
>> used for cache?
>>
>> On Thu, Jan 14, 2010 at 11:06 AM, Suhail Doshi 
>> wrote:
>> > Right now it's ~5-10 keys, with 5 columns per key.
>> >
>> > Later it will be 64 keys (max) with 200 columns per key worst case.
>> >
>> > Suhail
>> >
>> > On Thu, Jan 14, 2010 at 9:45 AM, Jonathan Ellis 
>> wrote:
>> >
>> >> how many keys are you fetching?  how many columns for each key?
>> >>
>> >> On Thu, Jan 14, 2010 at 1:49 AM, Suhail Doshi 
>> wrote:
>> >> > I've been seeing multiget_slice take an extremely long time:
>> >> >
>> >> > 2010-01-14 07:44:00,513 INFO -- Cassandra, delay:
>> >> > 3.64020800591 ---
>> >> > 2010-01-14 07:44:00,513 INFO method: multiget_slice
>> >> > 2010-01-14 07:44:00,513 INFO {'keys':
>> >> >
>> >>
>> [u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a93ec971e867b23664d990336ce481e0:7516fd43adaa5e0b8a65a672c39845d2',
>> >> >
>> >>
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:fe33779b0db3213f7e354c8e22ad9939:4df200d45716195e86c09a94a54a0c7a',
>> >> >
>> >>
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:71860c77c6745379b0d44304d66b6a13:e37f0136aa3ffaf149b351f6a4c948e9',
>> >> >
>> >>
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:1240f61999709d41292f759e500ad5be:69691c7bdcc3ce6d5d8a1361f22d04ac',
>> >> >
>> >>
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a6d5b5c3d715b79b59caf7aed18301ac:b53b3a3d6ab90ce0268229151c9bde11'],
>> >> > 'column_parent': ColumnParent(column_family='DistinctIndex',
>> >> > super_column=None), 'predicate': SlicePredicate(column_names=None,
>> >> > slice_range=SliceRange(count=14000, start='date_2009-07-01',
>> >> reversed=False,
>> >> > finish='date_2010-01-14'))}
>> >> >
>> >> > 2010-01-14 07:44:00,513 INFO result:
>> >> >
>> >> > 2010-01-14 07:44:00,513 INFO
>> >> >
>> >>
>> {'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:fe33779b0db3213f7e354c8e22ad9939:4df200d45716195e86c09a94a54a0c7a':
>> >> > [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
>> >> > name='date_2010-01-11', value='1'), super_column=None),
>> >> > ColumnOrSuperColumn(column=Column(timestamp=126256,
>> >> > name='date_2010-01-12', value='1'), super_column=None),
>> >> > ColumnOrSuperColumn(column=Column(timestamp=1263418556,
>> >> > name='date_2010-01-13', value='1'), super_column=None),
>> >> > ColumnOrSuperColumn(column=Column(timestamp=1263451804,
>> >> > name='date_2010-01-14', value='1'), super_column=None)],
>> >> >
>> >>
>> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:71860c77c6745379b0d44304d66b6a13:e37f0136aa3ffaf149b351f6a4c948e9':
>> >> > [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
>> >> > name='date_2010-01-11', value='1'), super_column=None),
>> >> > ColumnOrSuperColumn(column=Column(timestamp=126256,
>> >> > name='date_2010-01-12', value='1'), super_column=No

Re: multiget_slice

2010-01-14 Thread Jonathan Ellis
how much data do you have on disk?  (only on enode?)  how large are
the columns you are reading?  how much ram does vmstat say is being
used for cache?

On Thu, Jan 14, 2010 at 11:06 AM, Suhail Doshi  wrote:
> Right now it's ~5-10 keys, with 5 columns per key.
>
> Later it will be 64 keys (max) with 200 columns per key worst case.
>
> Suhail
>
> On Thu, Jan 14, 2010 at 9:45 AM, Jonathan Ellis  wrote:
>
>> how many keys are you fetching?  how many columns for each key?
>>
>> On Thu, Jan 14, 2010 at 1:49 AM, Suhail Doshi  wrote:
>> > I've been seeing multiget_slice take an extremely long time:
>> >
>> > 2010-01-14 07:44:00,513 INFO -- Cassandra, delay:
>> > 3.64020800591 ---
>> > 2010-01-14 07:44:00,513 INFO method: multiget_slice
>> > 2010-01-14 07:44:00,513 INFO {'keys':
>> >
>> [u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a93ec971e867b23664d990336ce481e0:7516fd43adaa5e0b8a65a672c39845d2',
>> >
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:fe33779b0db3213f7e354c8e22ad9939:4df200d45716195e86c09a94a54a0c7a',
>> >
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:71860c77c6745379b0d44304d66b6a13:e37f0136aa3ffaf149b351f6a4c948e9',
>> >
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:1240f61999709d41292f759e500ad5be:69691c7bdcc3ce6d5d8a1361f22d04ac',
>> >
>> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a6d5b5c3d715b79b59caf7aed18301ac:b53b3a3d6ab90ce0268229151c9bde11'],
>> > 'column_parent': ColumnParent(column_family='DistinctIndex',
>> > super_column=None), 'predicate': SlicePredicate(column_names=None,
>> > slice_range=SliceRange(count=14000, start='date_2009-07-01',
>> reversed=False,
>> > finish='date_2010-01-14'))}
>> >
>> > 2010-01-14 07:44:00,513 INFO result:
>> >
>> > 2010-01-14 07:44:00,513 INFO
>> >
>> {'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:fe33779b0db3213f7e354c8e22ad9939:4df200d45716195e86c09a94a54a0c7a':
>> > [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
>> > name='date_2010-01-11', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=126256,
>> > name='date_2010-01-12', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=1263418556,
>> > name='date_2010-01-13', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=1263451804,
>> > name='date_2010-01-14', value='1'), super_column=None)],
>> >
>> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:71860c77c6745379b0d44304d66b6a13:e37f0136aa3ffaf149b351f6a4c948e9':
>> > [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
>> > name='date_2010-01-11', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=126256,
>> > name='date_2010-01-12', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=1263418556,
>> > name='date_2010-01-13', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=1263451804,
>> > name='date_2010-01-14', value='1'), super_column=None)],
>> >
>> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a6d5b5c3d715b79b59caf7aed18301ac:b53b3a3d6ab90ce0268229151c9bde11':
>> > [ColumnOrSuperColumn(column=Column(timestamp=126256,
>> > name='date_2010-01-12', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=1263418556,
>> > name='date_2010-01-13', value='1'), super_column=None),
>> > ColumnOrSuperColumn(column=Column(timestamp=1263451804,
>> > name='date_2010-01-14', value='1'), super_column=None)],
>> >
>> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a93ec971e867b23664d990336ce481e0:7516fd43adaa5e0b8a65a672c39845d2':
>> > [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
>> > name='date_2010-01-11', value='1'), super_column=None),
>> > 

Re: multiget_slice

2010-01-14 Thread Jonathan Ellis
how many keys are you fetching?  how many columns for each key?

On Thu, Jan 14, 2010 at 1:49 AM, Suhail Doshi  wrote:
> I've been seeing multiget_slice take an extremely long time:
>
> 2010-01-14 07:44:00,513 INFO -- Cassandra, delay:
> 3.64020800591 ---
> 2010-01-14 07:44:00,513 INFO method: multiget_slice
> 2010-01-14 07:44:00,513 INFO {'keys':
> [u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a93ec971e867b23664d990336ce481e0:7516fd43adaa5e0b8a65a672c39845d2',
> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:fe33779b0db3213f7e354c8e22ad9939:4df200d45716195e86c09a94a54a0c7a',
> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:71860c77c6745379b0d44304d66b6a13:e37f0136aa3ffaf149b351f6a4c948e9',
> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:1240f61999709d41292f759e500ad5be:69691c7bdcc3ce6d5d8a1361f22d04ac',
> u'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a6d5b5c3d715b79b59caf7aed18301ac:b53b3a3d6ab90ce0268229151c9bde11'],
> 'column_parent': ColumnParent(column_family='DistinctIndex',
> super_column=None), 'predicate': SlicePredicate(column_names=None,
> slice_range=SliceRange(count=14000, start='date_2009-07-01', reversed=False,
> finish='date_2010-01-14'))}
>
> 2010-01-14 07:44:00,513 INFO result:
>
> 2010-01-14 07:44:00,513 INFO
> {'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:fe33779b0db3213f7e354c8e22ad9939:4df200d45716195e86c09a94a54a0c7a':
> [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
> name='date_2010-01-11', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=126256,
> name='date_2010-01-12', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263418556,
> name='date_2010-01-13', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263451804,
> name='date_2010-01-14', value='1'), super_column=None)],
> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:71860c77c6745379b0d44304d66b6a13:e37f0136aa3ffaf149b351f6a4c948e9':
> [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
> name='date_2010-01-11', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=126256,
> name='date_2010-01-12', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263418556,
> name='date_2010-01-13', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263451804,
> name='date_2010-01-14', value='1'), super_column=None)],
> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a6d5b5c3d715b79b59caf7aed18301ac:b53b3a3d6ab90ce0268229151c9bde11':
> [ColumnOrSuperColumn(column=Column(timestamp=126256,
> name='date_2010-01-12', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263418556,
> name='date_2010-01-13', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263451804,
> name='date_2010-01-14', value='1'), super_column=None)],
> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:a93ec971e867b23664d990336ce481e0:7516fd43adaa5e0b8a65a672c39845d2':
> [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
> name='date_2010-01-11', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=126256,
> name='date_2010-01-12', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263418556,
> name='date_2010-01-13', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263451804,
> name='date_2010-01-14', value='1'), super_column=None)],
> 'property:1558:1f0351b7f85b4aa070548e5fd5e08ddf:fce1eab4411d5df240d93ff334f15385:1240f61999709d41292f759e500ad5be:69691c7bdcc3ce6d5d8a1361f22d04ac':
> [ColumnOrSuperColumn(column=Column(timestamp=1263231323,
> name='date_2010-01-11', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=126256,
> name='date_2010-01-12', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263418556,
> name='date_2010-01-13', value='1'), super_column=None),
> ColumnOrSuperColumn(column=Column(timestamp=1263451804,
> name='date_2010-01-14', value='1'), super_column=None)]}
>
>
> The delay is the time at which it took to run the query and then return a
> result. The box has 4GB of RAM and the *JVM_MAX_MEM (-Xmx) is set at 3G*. If
> you're curious how I am running it:
>
> /usr/bin/jsvc -home /usr/lib/jvm/java-6-openjdk/jre -pidfile
> /var/run/cassandra.pid -errfile &1 -outfile /var/log/cassandra/output.log
> -cp
> /usr/share/cassandra/antlr-3.1.3.jar:/usr/share/cassandra/apache-cassandra-incubating-0.5.0-rc1.jar:/usr/share/cassandra/apache-cassandra-incubating.jar:/usr/share/cassandra/clhm-production.jar:/usr/share/cassandra/commo

Re: release policy

2010-01-11 Thread Jonathan Ellis
On Mon, Jan 11, 2010 at 12:03 PM, Ryan King  wrote:
> Both of the above are fine. I think we could even tolerate having to
> run an upgrade tool with a node, as long as we can do it one at a time
> and as long as...
>
>>  (3) network compatibility (from one cluster node to another): may
>> change.  If it does, we will notify you in NEWS.txt that you need to
>> upgrade the whole cluster at once, as was the case for 0.4 -> 0.5.
>
> ...the network compat doesn't change at the same time. If both the
> disk format and network protocol change in the same release, we can't
> easy do a rolling restart/upgrade.
>
> In general, doing full cluster upgrades at once is going to be
> prohibitively difficult for us. In addition to the disruption it would
> cause for clients, we wouldn't want to throw away all of our
> in-process caches at once.

I'd be comfortable committing to not breaking net compatibility in the
same release as one that needs an upgrade tool to run on the data.  I
don't think we can freeze the network protocol entirely yet.

> If we want the flexibility of changing the internal network protocol,
> we should move towards an rpc framework that can tolerate upgrades.

Thrift is supposed to tolerate upgrades... *cough* :)

-Jonathan


the release after 0.5

2010-01-08 Thread Jonathan Ellis
In the month since 0.5 was branched, we've already made some
significant progress, particularly in performance.  I can't find a way
to easily link the full list in Jira, but these include

  408+669 (mmapping sstables for reads instead of using buffered I/O):
~50% speed improvement
  658 (better write concurrency): ~1000% improvement when cluster is
in degraded state
  675 (faster communication between nodes): ~100% improvement of reads
and writes
  678 (row level caching): up to 120% improvement of reads (workload dependent)

We also have some other interesting tickets done:
  336: add the ability to insert data to multiple rows at once -- like
multiget, but for writes
  535: expose StorageProxy to use as a fat client (with
StorageService.initClient)
  599: give some visibility of what Compaction is busy doing (can be a
major source of "why is it slow?")

We have a lot of other issues tagged 0.9
(https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&mode=hide&sorter/order=DESC&sorter/field=priority&resolution=-1&pid=12310865&fixfor=12314361),
but the above issues and the others done so far are already useful
enough to release, both because it helps our existing users of 0.5
[note that none of these introduce compatibility issues of any sort,
which makes upgrading especially easy], and because improving
performance by that much makes us look better, which helps grow the
community. :)

On the other hand, creating and stabilizing and testing a new release
(that is not bugfix-only) is a non-negligible amount of overhead, and
I would give extra weight to Eric Evans's opinion here as release
manager.

-Jonathan


release policy

2010-01-08 Thread Jonathan Ellis
I think we have enough people using Cassandra in production now that
it would be useful to be explicit about the kinds of changes we will
make between major and minor releases.  Here is one possibility:

Minor releases (e.g. 0.4.0 -> 0.4.1): minor releases will contain bug
fixes only -- no new functionality, full disk and network-level
compatibility.  If bugs cannot be fixed without adding new
functionality or breaking compatibility, then it should probably not
be fixed in the stable branch unless it is very severe.  This is
similar to postgresql's minor release policy, and I really like it for
two reasons:
  (1) it makes a clear line of what people can expect in minor releases
  (2) making changes breaks stuff.  That's just how software is, so
the less we change in a stable branch the more likely it is that we
won't introduce regressions.  With our community's most enthusiastic
testers also the most interested in trunk rather than older releases,
it behooves us to be careful.

Major releases (e.g. 0.5 -> 0.6): the compatibility story is going to
be more nuanced:
  (1) sstable compatibility: may change.  If it does, we will notify
you in NEWS.txt and provide some method of upgrading without
dump-and-reload (and hopefully without downtime while a conversion
tool grinds away from old to new format, but I don't think we can
promise that with 100% certainty at this point).
  (2) commitlog compatibility: may change.  If it does, we will notify
you in NEWS.txt that you need to "nodeprobe flush" before upgrading,
as was the case for 0.4 -> 0.5.
  (3) network compatibility (from one cluster node to another): may
change.  If it does, we will notify you in NEWS.txt that you need to
upgrade the whole cluster at once, as was the case for 0.4 -> 0.5.
  (4) thrift api: We will make our best efforts to keep deprecated
methods will be available for one major release before being removed.
(e.g., get_key_range is deprecated in 0.5; it will be removed in the
next).  I do not anticipate having to make the kind of changes we made
from 0.3 to 0.4 (where we redid basically all the thrift structures,
and there was really no sane way to provide backwards compatibility),
but if we do we will notify you in NEWS.txt.
  (5) StorageProxy "fat client" api: This is still considered "mostly
an internal api," to be used if you are brave and need the extra
performance badly enough. :)  We probably won't render it
unrecognizably different since it's a fairly one-to-one mapping to the
thrift api, which has the above compatibility policy, but no promises.

Thoughts?  (Separate threads to follow re "the next major release
after 0.5 specifically," and "1.0.")

-Jonathan


Re: API versioning

2010-01-06 Thread Jonathan Ellis
On Wed, Jan 6, 2010 at 2:47 PM, Eric Evans  wrote:
> On Wed, 2010-01-06 at 14:29 -0600, Jonathan Ellis wrote:
>> The 0.5 api is a superset of the 0.4 one in method names and
>> arguments, but the exceptions declared are different, so client code
>> in compiled languages with checked exceptions (only Java?) probably
>> needed some light editing to upgrade.
>
> That still breaks existing code though. Would you not bump the major for
> such a case?

I guess that is fine, although it still feels like this introduces
confusion for people whose code is NOT affected (i.e. most non-java
languages).

I guess ultimately there is no substitute for reading release notes.

-Jonathan


Re: API versioning

2010-01-06 Thread Jonathan Ellis
+1, the release version is only tenously related to the API version
and tracking the latter separately would be much more useful to
clients for the reasons you gave.

One question: do we need a 3-tuple?

The 0.5 api is a superset of the 0.4 one in method names and
arguments, but the exceptions declared are different, so client code
in compiled languages with checked exceptions (only Java?) probably
needed some light editing to upgrade.

I'm also happy to just ignore that until someone actually complains though. :)

-Jonathan

On Wed, Jan 6, 2010 at 2:23 PM, Eric Evans  wrote:
>
> I'd like to propose a change to the way we version our API.
>
> Currently, we publish a version string via the thrift method
> get_string_property("version"). This version string always moves in
> lock-step with the current release, i.e. 0.4.0-beta2, 0.5.0-rc3, etc.
>
> There is no useful correlation that can be made like this. If a client
> API worked with 0.5.0-beta1, it might or might not work with
> 0.5.0-beta2. I think we can do better.
>
> I propose that we return a string composed of joining two integers with
> a ".", where the integers represent a major and minor respectively. The
> rules for incrementing these would be simple:
>
> 1. If it is absolutely breaking, then the major is incremented by one.
> For example, changing the number or disposition of required arguments.
>
> 2. If it will result in an API that is backward-compatible with the
> previous version, then the minor is incremented. For example, if a new
> method is added.
>
> This would provide client API authors the tools necessary to ensure
> compatibility at runtime, and to better manage the life-cycle of their
> projects.
>
> What does everyone think?
>
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: [VOTE] Release 0.5.0-rc3

2010-01-06 Thread Jonathan Ellis
+1

On Wed, Jan 6, 2010 at 9:49 AM, Eric Evans  wrote:
>
> Ok let's try this again, this time with the actual fix for #663. :)
>
> I propose the following tag and artifacts for 0.5.0-rc3:
>
> SVN Tag:
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc3
> 0.5.0-rc3 artifacts: http://people.apache.org/~eevans
>
> +1 from me.
>
>
> [1]
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc3/CHANGES.txt
>
> --
> Eric Evans
> eev...@rackspace.com
>
>
>
>
>


Re: [VOTE] Release 0.5.0-rc2

2010-01-05 Thread Jonathan Ellis
oops, -1: I committed the wrong version of the fix for 663.  Correct
fix has been committed to 0.5 branch now.

-Jonathan

On Tue, Jan 5, 2010 at 4:22 PM, Jonathan Ellis  wrote:
> +1
>
> On Tue, Jan 5, 2010 at 4:07 PM, Eric Evans  wrote:
>>
>> There were some issues in rc1 that warrant us taking another stab at
>> this, (see [1]).
>>
>> I propose the following tag and artifacts for 0.5.0-rc2:
>>
>> SVN Tag:
>> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc2
>> 0.5.0-rc2 artifacts: http://people.apache.org/~eevans
>>
>> +1 from me.
>>
>>
>> [1]
>> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc2/CHANGES.txt
>>
>> --
>> Eric Evans
>> eev...@rackspace.com
>>
>>
>


Re: Chair

2010-01-05 Thread Jonathan Ellis
I can volunteer.

On Tue, Jan 5, 2010 at 4:27 PM, Eric Evans  wrote:
>
> When we graduate, we need to have someone who will act as the Chair of
> the Project Management Committee, or "Chair". The Chair reports to the
> ASF Board on behalf of the project, maintains information on the PMC
> disposition, provides write access to new committers, etc. So besides
> sounding kind of prestigious, the Chair is basically a glorified
> administrative assistant. :)
>
> http://www.apache.org/dev/pmc.html#chair
>
> My understanding is that the Chair a) needs to be on the PMC[1], and b)
> must volunteer for the job. So, do we have any volunteers? Jonathan
> (hint, hint :)?
>
>
> [1] http://incubator.apache.org/projects/cassandra.html#Project+info
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: [VOTE] Release 0.5.0-rc2

2010-01-05 Thread Jonathan Ellis
+1

On Tue, Jan 5, 2010 at 4:07 PM, Eric Evans  wrote:
>
> There were some issues in rc1 that warrant us taking another stab at
> this, (see [1]).
>
> I propose the following tag and artifacts for 0.5.0-rc2:
>
> SVN Tag:
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc2
> 0.5.0-rc2 artifacts: http://people.apache.org/~eevans
>
> +1 from me.
>
>
> [1]
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc2/CHANGES.txt
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: Memtable Performance Problem

2009-12-30 Thread Jonathan Ellis
I went one better, all the way to column-level synchronization. :)
Patches attached to
https://issues.apache.org/jira/browse/CASSANDRA-658.

review / testing appreciated.

-Jonathan

On Fri, Dec 11, 2009 at 3:53 AM, Stu Hood  wrote:
> Seems like we could just synchronize on the 'oldCf' object, since it can't be 
> replaced once it is in the Memtable. Much higher granularity.


Welcome new committers Gary Dusbabek and Jaakko Laine!

2009-12-30 Thread Jonathan Ellis
Thanks for the help, guys!

-Jonathan


Re: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and sometimes kills traffic to a node if you restart it.

2009-12-28 Thread Jonathan Ellis
If you want this to be part of the jira record, you need to add it as
a comment on the issue; jira is not configured to turn emails into
comments automatically.

On Sun, Dec 27, 2009 at 11:07 PM, Michael Lee
 wrote:
> Confirm this issue by following tests
> suppose a cluster contained 8 nodes, which contained about 1 rows(key 
> range from 1 to 1):
> Address       Status     Load          Range                                  
>     Ring
>                                       170141183460469231731687303715884105728
> 10.237.4.85   Up         757.13 MB     21267647932558653966460912964485513216 
>     |<--|
> 10.237.1.135  Up         761.54 MB     42535295865117307932921825928971026432 
>     |   ^
> 10.237.1.137  Up         748.02 MB     63802943797675961899382738893456539648 
>     v   |
> 10.237.1.139  Up         732.36 MB     85070591730234615865843651857942052864 
>     |   ^
> 10.237.1.140  Up         725.6 MB      
> 106338239662793269832304564822427566080    v   |
> 10.237.1.141  Up         726.59 MB     
> 127605887595351923798765477786913079296    |   ^
> 10.237.1.143  Up         728.16 MB     
> 148873535527910577765226390751398592512    v   |
> 10.237.1.144  Up         745.69 MB     
> 170141183460469231731687303715884105728    |-->|
>
> (1)     Read keys range [1-1], all keys read out ok ( client send read 
> request directly to 10.237.4.85, 10.237.1.137, 10.237.1.140, 10.237.1.143 )
> (2)     Turn-off 10.237.1.135 while remain pressure, some read request will 
> time out,
> after all nodes know 10.237.1.135 has down (about 10 s later), all read 
> request become ok again, that’s fine
> (3)     After turn-on 10.237.1.135(and cassandra service also), some read 
> request will time out again, and will remain FOREVER even all nodes know 
> 10.237.1.135 has up,
> That’s a PROBLEM!
> (4)     Reboot 10.237.1.135, problem remains.
> (5)     If stop pressure and reboot whole cluster then perform step 1, all 
> things are fine, again…..
>
> All read request use Quorum policy, version of Cassandra is 
> apache-cassandra-incubating-0.5.0-beta2, and I’ve tested 
> apache-cassandra-incubating-0.5.0-RC1, problem remains.
>
> After read system.log, I found after 10.237.1.135 down and up again, other 
> nodes will not establish tcp connection to it(on tcp port 7000 ) forever!
> And read request sent to 10.237.1.135(into Pending-Writes because socket 
> channel is closed) will not sent to net forever(from observing tcpdump).
>
> It’s seems when 10.237.1.135 going down in step2, some socket channel was 
> reset ,
> after 10.237.1.135 come back, these socket channel remain closed, forever
> -END--
>
>
> -Original Message-
> From: Jonathan Ellis (JIRA) [mailto:j...@apache.org]
> Sent: Thursday, December 24, 2009 10:47 AM
> To: cassandra-comm...@incubator.apache.org
> Subject: [jira] Updated: (CASSANDRA-651) cassandra 0.5 version throttles and 
> sometimes kills traffic to a node if you restart it.
>
>
>     [ 
> https://issues.apache.org/jira/browse/CASSANDRA-651?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
>
> Jonathan Ellis updated CASSANDRA-651:
> -
>
>    Fix Version/s: 0.5
>         Assignee: Jaakko Laine
>
>> cassandra 0.5 version throttles and sometimes kills traffic to a node if you 
>> restart it.
>> 
>>
>>                 Key: CASSANDRA-651
>>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-651
>>             Project: Cassandra
>>          Issue Type: Bug
>>          Components: Core
>>    Affects Versions: 0.5
>>         Environment: latest in 0.5 branch
>>            Reporter: Ramzi Rabah
>>            Assignee: Jaakko Laine
>>             Fix For: 0.5
>>
>>
>> From the cassandra user message board:
>> "I just recently upgraded to latest in 0.5 branch, and I am running
>> into a serious issue. I have a cluster with 4 nodes, rackunaware
>> strategy, and using my own tokens distributed evenly over the hash
>> space. I am writing/reading equally to them at an equal rate of about
>> 230 reads/writes per second(and cfstats shows that). The first 3 nodes
>> are seeds, the last one isn't. When I start all the nodes together at
>> the same time, they all receive equal amounts of reads/writes (about
>> 230).
>> When I bring node 4 down and bring it back up again, node 4's load
>> fluctuates between the 230 it used to get to sometimes no traffic at
>> all. The other 3 still have the same amount of traffic. And no errors
>> what so ever seen in logs. "
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>


Re: [VOTE] Release 0.5.0-rc1

2009-12-23 Thread Jonathan Ellis
I have no problem addressing this during the RC period.

On Wed, Dec 23, 2009 at 7:11 PM, Ramzi Rabah  wrote:
> Did you guys see https://issues.apache.org/jira/browse/CASSANDRA-651.
> That looks like a show stopper to me, when you can't restart a node.
>
> On Wed, Dec 23, 2009 at 1:43 PM, Chris Goffinet  
> wrote:
>> +1
>>
>> On Wed, Dec 23, 2009 at 4:36 PM, Jonathan Ellis  wrote:
>>
>>> +1
>>>
>>> On Wed, Dec 23, 2009 at 3:28 PM, Eric Evans  wrote:
>>> >
>>> > All of the 0.5 showstoppers are out of the way and things are looking
>>> > pretty solid. Shall we push out a release candidate?
>>> >
>>> > I propose the following tag and artifacts for 0.5.0-rc1
>>> >
>>> > SVN Tag:
>>> >
>>> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc1
>>> > 0.5.0-rc1 artifacts: 
>>> > http://people.apache.org/~eevans<http://people.apache.org/%7Eeevans>
>>> >
>>> > If it's not obvious, +1 from me. :)
>>> >
>>> > [1]
>>> >
>>> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc1/CHANGES.txt
>>> >
>>> > --
>>> > Eric Evans
>>> > eev...@rackspace.com
>>> >
>>> >
>>>
>>
>


Re: [VOTE] Release 0.5.0-rc1

2009-12-23 Thread Jonathan Ellis
+1

On Wed, Dec 23, 2009 at 3:28 PM, Eric Evans  wrote:
>
> All of the 0.5 showstoppers are out of the way and things are looking
> pretty solid. Shall we push out a release candidate?
>
> I propose the following tag and artifacts for 0.5.0-rc1
>
> SVN Tag:
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc1
> 0.5.0-rc1 artifacts: http://people.apache.org/~eevans
>
> If it's not obvious, +1 from me. :)
>
> [1]
> https://svn.apache.org/repos/asf/incubator/cassandra/tags/cassandra-0.5.0-rc1/CHANGES.txt
>
> --
> Eric Evans
> eev...@rackspace.com
>
>


Re: build fails with "ant clean gen-thrift-java build"

2009-12-22 Thread Jonathan Ellis
2009/12/22 Ted Zlatanov :
> But would you (as Gary IIRC mentioned
> earlier) prefer the old constructors back instead to minimize changes to
> Cassandra?

In a perfect world, Thrift wouldn't go breaking stuff that wasn't
causing problems, or if they did they would admit it and roll things
back.  It's already too late for the first, and I don't give the
second good odds either.

But this is not something to burn a lot of cycles on.  If your heart
is set on using thrift 0.2, then patch Cassandra so it works.
Otherwise just build the new structures with the old thrift compiler,
just use the version in the libthrift filename.

-Jonathan


Re: build fails with "ant clean gen-thrift-java build"

2009-12-22 Thread Jonathan Ellis
2009/12/22 Ted Zlatanov :
> Looks like this is not getting changed and Cassandra must cope with
> Thrift's new constructors instead.  Will the updated code make it into
> SVN so I can do my auth patch against it?

As soon as such a patch is contributed, sure.

-Jonathan


Re: Small HintedHanoffManager improvement?

2009-12-22 Thread Jonathan Ellis
On Mon, Dec 21, 2009 at 12:46 PM, Ramzi Rabah  wrote:
> It seems that we still send a row mutation even if the cf of the row
> is null.

It should never happen, and if it does it's harmless, so adding a
special case for it is counterproductive.

> On a different note, I have a few questions about HHOM design.
> HintedHandOffManager seems to send the whole CF for a key, even if the
> only thing that changed was a value in 1 single column.

Only the changed parts are written to hint nodes in the first place.

> And one last question about HHOM, since the node that is handling the
> Hint might not be (and is probably not) one of the replicas if I
> understood the code correctly, Will the data written to it ever be
> cleaned, if I issue a delete later on down the line?

Hinted data is removed by cleanup operations.

-Jonathan


Re: Memtable Performance Problem

2009-12-10 Thread Jonathan Ellis
even with CHM, resolve is not threadsafe and needs to be synchronized.
 removing the synchronized could cause data loss.  don't do that. :)

On Thu, Dec 10, 2009 at 9:42 PM, 张洁  wrote:
> When I was doing to write stress test,I found that the throughput is not
> stable,from 200request/s to 12000request/s,in single thread continue write。
> final i found problem is NonBlockHashMap in Memtable ,i use
> ConcurrentHashMap instead of NonBlockHashMap,and remove synchronized block
> below,the stress test is stable。
>
> /resolve function in memtable
> synchronized (keyLocks[Math.abs(key.hashCode() % keyLocks.length)])
>     {
>     int oldSize = oldCf.size();
>     int oldObjectCount = oldCf.getColumnCount();
>     oldCf.resolve(columnFamily);
>     int newSize = oldCf.size();
>     int newObjectCount = oldCf.getColumnCount();
>     resolveSize(oldSize, newSize);
>     resolveCount(oldObjectCount, newObjectCount);
>     }
>
>
> below is my test code,my test server has 2*4=8CPU and 32G Memory
>
> // Decompiled by Jad v1.5.8e. Copyright 2001 Pavel Kouznetsov.
> // Jad home page: http://www.geocities.com/kpdus/jad.html
> // Decompiler options: packimports(3)
> // Source File Name:   RowApplyTest.java
>
> import java.io.IOException;
> import java.io.PrintStream;
> import java.util.concurrent.atomic.AtomicLong;
> import org.apache.cassandra.db.*;
>
> public class RowApplyTest
> {
>
>     public RowApplyTest()
>     {
>     }
>
>     public static Column column(String name, String value, long timestamp)
>     {
>     return new Column(name.getBytes(), value.getBytes(), timestamp);
>     }
>
>     private static void printer()
>     {
>     Thread t = new Thread(new Runnable() {
>
>     public void run()
>     {
>     do
>     {
>     long current = RowApplyTest._count.get();
>     System.out.println((new StringBuilder("Rate:
> ")).append(current - _last).append(" req/s").toString());
>     _last = current;
>     try
>     {
>     Thread.sleep(1000L);
>     }
>     catch(InterruptedException e)
>     {
>     e.printStackTrace();
>     }
>     } while(true);
>     }
>
>     private long _last =0L;
>     });
>     t.start();
>     }
>
>     public static void main(String args[])
>     throws IOException
>     {
>     printer();
>     Table table = Table.open("Keyspace1");
>     ColumnFamilyStore cfStore = table.getColumnFamilyStore("Standard1");
>     String value = "Agile
> testing(\u654F\u6377\u6D4B\u8BD5)\u57FA\u672C\u4E0A\u662F\u4F34\u968F\u7740\u654F\u6377\u5F00\u53D1\u7684\u6982\u5FF5\u6210\u957F\u8D77\u6765\u7684\uFF0C\u4F46\u5728\u53D7\u5173\u6CE8\u7A0B\u5EA6\u4E0A\uFF0C\u8FDC\u8FDC\u4E0D\u53CA\u654F\u6377\u5F00\u53D1\u672C\u8EAB\u3002\u81EA\u7136\uFF0C\u5F00\u53D1\u961F\u4F0D\u4ECE\u6570\u91CF\u548C\u6D3B\u8DC3\u5EA6\u4E0A\u6765\u8BB2\u5927\u4E8E\u6D4B\u8BD5\u961F\u4F0D\uFF0C\u662F\u5176\u4E2D\u7684\u4E00\u4E2A\u539F\u56E0\uFF1B\u9664\u4E86\u8FD9\u4E2A\u539F\u56E0\u4E4B\u5916\uFF0C\u201C\u654F\u6377\u6D4B\u8BD5\u7A76\u7ADF\u5982\u4F55\u5728\u9879\u76EE\u4E2D\u53D1\u6325\u4F5C\u7528\u201D\u8FD9\u4E2A\u95EE\u9898\u53EF\u80FD\u4E5F\u662F\u5BFC\u81F4\u654F\u6377\u6D4B\u8BD5\u6982\u5FF5\u7684\u6D41\u884C\u5EA6\u8FDC\u8FDC\u4E0D\u5982\u654F\u6377\u5F00\u53D1\u7684\u539F\u56E0\u4E4B\u4E00\u3002\u5173\u4E8E\u654F\u6377\u6D4B\u8BD5\uFF0C\u6211\u80FD\u627E\u5230\u7684\u8F83\u65E9\u7684...";
>     do
>     {
>     long i = _count.incrementAndGet();
>     String key = (new StringBuilder("test")).append(i).toString();
>     RowMutation rm = new RowMutation("Keyspace1", key);
>     ColumnFamily cf = ColumnFamily.create("Keyspace1", "Standard1");
>     cf.addColumn(column("name", value, 1L));
>     rm.add(cf);
>     rm.apply();
>     } while(true);
>     }
>
>     private static AtomicLong _count = new AtomicLong(0L);
>
>
> }


Re: dead code: org.apache.cassandra.net.sink.SinkManager

2009-12-09 Thread Jonathan Ellis
The authors of that class said (in a presentation?) that it's used to
introduce artificial errors for testing.  So I'm in no hurry to delete
it.

On Wed, Dec 9, 2009 at 8:53 PM, Kelvin Kakugawa  wrote:
> I've been looking around the codebase, and I was wondering if these
> classes are dead code:
> org.apache.cassandra.net.sink.IMessageSink
> org.apache.cassandra.net.sink.SinkManager
>
> There are a couple places where SinkManager is used, however,
> addMessageSink() is never called.  So, the calls to
> process*MessageSink() appear to be no-ops.
>
> -Kelvin
>


Re: [VOTE] Release 0.5.0-beta2

2009-12-09 Thread Jonathan Ellis
+1


Re: Questions about weak reads

2009-12-09 Thread Jonathan Ellis
On Wed, Dec 9, 2009 at 11:01 AM, Sylvain Lebresne  wrote:
> And wouldn't it be possible/reasonable to do something like a strong read, but
> with a modified quorumResponseHandler that return from get() as soon as it
> gets an answer and do the responseResolver/read repair in the background.
> Wouldn't it speed up some reads when a node other that the suitableEndpoint
> has less latency ?

That's reasonable, but of course the penalty is increased network
traffic since you are sending full responses from each replica instead
of just digests.

-Jonathan


Re: Questions about weak reads

2009-12-09 Thread Jonathan Ellis
On Wed, Dec 9, 2009 at 10:33 AM, Sylvain Lebresne  wrote:
> Well, I just checkout from svn
> (svn checkout https://svn.apache.org/repos/asf/incubator/cassandra/trunk
> cassandra)

Thanks, updated comments.

> In any case, at least for readRemote, why when the "suitableEndpoint"
> timeout another
> node is not tried ?

Because
 (1) clients need to be able to handle failures either way, so better
to bail early since
 (2) each retry costs potentially RPC_TIMEOUT ms

-Jonathan


Re: Questions about weak reads

2009-12-09 Thread Jonathan Ellis
On Wed, Dec 9, 2009 at 9:37 AM, Sylvain Lebresne  wrote:
> But otherwise, the discrepancy between code and comments suggests that the
> code was changed. If so, what was the rational behind the change ?

I'm guessing you're reading the 0.4 source?  This has been cleaned up
in trunk.  At least I'm pretty sure it has, because I remember the
comments you're referring to, and I don't see them anymore. :)

-Jonathan


Re: what are utils.Fast* for?

2009-12-05 Thread Jonathan Ellis
removed

On Sat, Dec 5, 2009 at 1:55 PM, gabriele renzi  wrote:
> On Sat, Dec 5, 2009 at 3:23 PM, Jonathan Ellis  wrote:
>> Dead code.  Deleted.
>
> thanks, I love when this happens :)
> I believe the same could be done for PrimeFinder.java
>
>> BTW, if you're looking through the codebase you should probably start
>> here: http://wiki.apache.org/cassandra/ArchitectureInternals
>
> thanks, I had been pointed to that on IRC, and it is indeed useful.
>


Re: what are utils.Fast* for?

2009-12-05 Thread Jonathan Ellis
Dead code.  Deleted.

BTW, if you're looking through the codebase you should probably start
here: http://wiki.apache.org/cassandra/ArchitectureInternals

On Sat, Dec 5, 2009 at 5:14 AM, gabriele renzi  wrote:
> Hi everyone,
>
> while trying to get a grasp of the codebase in cassandra, I run
> FindBugs on it and got a complaint about this code in
> org.apache.cassandra.utils.FastHashMap.containsValue(Object)
>
> """
>        // special case null values so that we don't have to
>        // perform null checks before every call to equals()
>        if (null == val)
>        {
>            for (int i = vals.length; i-- > 0;)
>            {
>                if ((set[i] != FREE && set[i] != REMOVED) && val == vals[i])
>                {
>                    return true;
>                }
>            }
>        }
> """
>
> as Findbugs would prefer the last condition to be spelled as
>  null == vals[i]
>
> I wanted to look around to check this was not a bug, but then I
> wondered what is this class for. Apparently all of FastHashMap,
> FastObjechHash, FastLinkedHashMap, and FastHash are never used
> anywhere in the code:
>
> $ ack 'Fast(Linked|Hash|Object)'  -l
> src/java/org/apache/cassandra/utils/FastHash.java
> src/java/org/apache/cassandra/utils/FastHashMap.java
> src/java/org/apache/cassandra/utils/FastLinkedHashMap.java
> src/java/org/apache/cassandra/utils/FastObjectHash.java
> $
>
>
> Does anyone have an explanation for this  (e.g. they are loaded by
> reflection by concatenating strings) or are they just remnants of old
> code?
>


  1   2   3   4   >