Cassandra data model misconceptions, and their sources

2009-08-17 Thread Evan Weaver
Ok, here are the common Cassandra misconceptions, and their sources,
gleaned from experience and talking to various people.

Not listed in any particular order.

1. A key is global, and data in different column families must be related.
  - BigTable paper
  - key precedence in Thrift API

2. Table is like a row-oriented table
  - the name
  - somewhat fixed by changing to keyspace

3. Keyspace is not like a database (in SQL/CouchDB/MongoDB)
  - because it's not called that

4. Columns are literally columnar
  - the name
  - column sets are stored per key, not per column family (unlike
relational DBs)
  - column name as a piece of data is unusual (esp. in relational DBs)

5. Columns are versioned
  - BigTable paper

6. Super columns are magical
  - Name has no precendence anywhere
  - Super columns do not have timestamps unlike columns
  - Other MVAs are not fully recursive; just have values

7. Difference between column family, column, and super column is not clear
  - Everything has column in the name
  - super, family, and  are not well-understood

8. Cassandra uses Paxos
  - BigTable paper

9. Cassandra uses client-side conflict resolution
  - Dynamo paper

A lot of things to get wrong, right off the bat.

Maybe this makes it clear why the BigTable references were not helpful
to us? For a new user, it provides as many wrong assumptions as
correct assumptions.

Evan

-- 
Evan Weaver


Re: [VOTE] Release cassandra 0.4.0-beta1

2009-08-17 Thread Ian Holsman

+1 from me.

On Aug 18, 2009, at 2:12 AM, ant elder wrote:

On Mon, Aug 17, 2009 at 4:55 PM, Eric Evanseev...@rackspace.com  
wrote:

On Mon, 2009-08-17 at 13:00 +0100, sebb wrote:

 Given whats being said in the Thrift release
 legal issues thread i think it should be ok to have the 3rd party
 licenses separate,


I disagree. It must be possible to find all the LICENSE files  
starting

at the initial LICENSE file. At the very least, the initial LICENSE
file should have pointers to the other license files.


the NOTICE file looks acceptable to me too.


AIUI, the NOTICE file needs to give attributions to all 3rd party  
code

included in the propose release.


When preparing for the 0.3.0[0] release I spent a great deal of time
trying to get all of this right. I looked at list threads for both
successful and failed podling release votes, I looked at what top- 
level
projects were doing, and I read through what documentation I could  
find.

This wasn't as helpful as I'd have liked because the documents are
non-normative and the application is inconsistent, (and occasionally
contradictory). So I did the best I could.

The conclusion I came to with respect to NOTICE.txt was that it  
existed
for purposes of attribution, and was specifically in response to  
section

4(d) of the Apache License. As a result, the NOTICE.txt in the
(approved )0.3.0 artifacts and the proposed 0.4.0, contains two
attribution statements, one for the Apache licensed Groovy, and one  
for

software developed by The Apache Software Foundation which should
cover everything else that is Apache licensed.

The conclusion I came to for LICENSE.txt was that it was for  
including

the full license text applicable to the project itself.

Both of the above conclusions seemed consistent with at least some
successful podling releases, and with some ASF top-level projects,  
and
(to the best of my knowledge), all of the license requirements for  
our

third-party dependencies are being met.

However, I'd be happy to go back and correct any shortcomings and
re-roll the artifacts if that will get us the votes we need to make a
release. I just wish things were more consistent and that the process
required a little less groping around in the dark.


[0]
http://www.mail-archive.com/gene...@incubator.apache.org/ 
msg21853.html


--
Eric Evans
eev...@rackspace.com




Ok, +1 from me to release. Happy to reconsider if anyone can find an
actual link to some evidence of specific policy that says how this
release is done is not ok.

  ...ant


--
Ian Holsman
i...@holsman.net





Re: Cassandra data model misconceptions, and their sources

2009-08-17 Thread Adam Rosien
I find the diagrams of Evan and folks
(http://blog.evanweaver.com/files/cassandra/twitter.jpg) much easier
to grok than any particular naming scheme. Annotating that diagram
with specific implementations or constraints, like your wiki page, is
a great addition.

.. Adam

On Mon, Aug 17, 2009 at 3:32 PM, Mark McBridemark.mcbr...@gmail.com wrote:
 My first attempt at a revamped data model wiki page is up here

 http://wiki.apache.org/cassandra/DataModel2

 This one follows phatduckk's approach of describing the data model
 bottom up, which I found more intuitive.  I'm interested to hear if

 1) It corrects some of the misconceptions people have run into
 2) The bottom up approach is more approachable than top down.
 3) I got everything covered and everything right :)

   ---Mark

 On Mon, Aug 17, 2009 at 11:31 AM, Edward
 Ribeiroedward.ribe...@gmail.com wrote:
 Right on target, Evan!

 When I first downloaded Cassandra, three months ago, I tried to make
 the analogy with BigTable, whose paper I'd already read, but the
 differences between Cassandra and BigTable made it quite hard to grasp
 some Cassandra concepts.

 Imho, as Table was renamed to Keyspace then Column should be the next
 concept to be renamed as showed by numbers 4, 5, 6, and 7 of your
 list. I would suggest to rename Column to Attribute (with the
 corresponding AttributeFamily or AttributeSet). It's not the best
 name, but right off the bat is what I can suggest.

 Edward




Re: Cassandra data model misconceptions, and their sources

2009-08-17 Thread Curt Micol
I've been thinking about this for a number of days, and again, while I am not a
developer I thought I might toss in a proposal if that's okay.

Since putting together a schema diagram and having a number of people review
it, I think a change is warranted. Too many people are coming from the RDBMS
world and the terms used by Cassandra are conflicting with those terms they
are already familiar with.

The TLDR version is as follows:

Object (Column)
ObjectFamily (ColumnFamily)
Directory (Row)
ObjectContainer (SuperColumn)
Namespace (Keyspace)

The long version...

Object (Column)
As Evan has stated repeatedly, column is a bit misleading especially when
compared to other types of database systems.  I think this is probably the
most important change to the data model names, and exactly where I started
since this is the 'core' of Cassandra.  Object gives the impression that this
is a piece of data, it's relatively structured but the name gives no
impression how strict that structure is. 'Objects' have names that have values
and timestamps. Simple and too the point. 'Object' doesn't come with the
preconceived notions that 'column' comes with and leaves room for Cassandra to
define what an 'object' is without any conflict to preexisting data
structures.

By changing this, we can move up the ladder to other data types and
easily rename them to something that 'contains objects' or 'accesses objects'.
This allows us to describe the data model in the name structure without
having to get too deep into the definition.

Directory (Row)
'row' is currently unnamed, but still a structure that exists in the model.
It's not specifically data itself, but more of a mapping of how to get to
objects (using keys). 'Directory' fills this void quite well. It is easily
explained as a path to get to data and not data itself.

ObjectFamily (ColumnFamily)
There's no argument that the one direct link to the BigTable paper is 'column
families'. It's perhaps the only structure that is virtually the same in both
pieces of software.  Considering this, I think we need to avoid too drastic a
change.  With that said, I think a change is necessary due to the differences
in columns between the two databases. 'object family' is descriptive of the
relation between objects and removes any reference to tabular structures while
keeping a loose relationship to 'column family' in the BigTable paper.

ObjectContainer (SuperColumn)
I could see this being shortened to 'container' in every day conversation.
However, 'objectcontainer' fits nicely with the rest of the data model names
and is descriptive of it's purpose and use. Ultimately a 'supercolumn' is
nothing more than a named container of columns (and I've seen on at least 3
different occasions the word container used to describe supercolumns).
'supercolumn' had no real connection to what exactly it was defining, but with
'object container' we have a clear understanding that we are naming the
structure that holds objects. Or as I explained it to a friend, we are naming
the 'jar' and not the 'honey'. :)

Namespace (Keyspace)
This one I go back and forth on. I know it's been changed from 'Table' to
'keyspace' and Evan proposed 'database', but I think that 'namespace' is
really what it is we are talking about. Wikipedia has this as the first line
to describe 'namespace':

A namespace is an abstract container or environment created to hold a
logical grouping of unique identifiers or symbols (i.e., names).

Originally I thought 'objectspace' would fit better, but I think 'namespace'
comes with a better history and is clearer to what this structure really is.
Especially when you relate the name namespace to how it is used in Ruby, Python
and Java. Ultimately though, I think I prefer 'keyspace' over 'table'
or 'database'.

The only issue I see with all of these names is the potential conflict with
programming languages and their objects. I know next to nothing about Java so
I don't know if there would be a conflict here. I've ran the following Google
search 'reserved words in *' where '*' is Ruby, Python, Java and C++ and
received no mention of 'object' being a reserved word in any of those
languages.

I also grep'd through current source code and there doesn't seem to be any
real conflicts that couldn't be named something else so as not to conflict
with this naming structure.

In the end, I think it's a good idea to look at this and work out a solution.
Documentation and tutorials are going to help, but I think people are so
entrenched in the RDBMS world that there is somewhat of a barrier to
understanding Cassandra's data model.

Thanks for your time,

-- 
# Curt Micol