date:20100419

Re: Regarding Cassandra Scalability

2010-04-19 Thread dir dir

Hi Paul,

I do not have any pressure to build software using Cassandra right now.
I am studying and exploring Cassandra now.  Hence I have a big curiosity
about Cassandra. Ok I will continue my study and wait better documentation.

Dir.

On Mon, Apr 19, 2010 at 1:44 PM, Paul Prescod pres...@gmail.com wrote:

 On Sun, Apr 18, 2010 at 9:14 AM, dir dir sikerasa...@gmail.com wrote:
  Hi Gary,
 
 The main reason is that the compaction operation (removing deleted
 values) currently requires that an entire row be read into memory.
 
  Thank you for your explanation. But I still do not understand what do you
  mean.

 Do you have a pressing need to use Cassandra right now, before version
 1.0 is even available?

 That limitation will go away before 1.0, so you could simply wait and
 not worry about it. Documentation will also be much more complete in
 the future.

  Paul Prescod

cassandra monitoring

2010-04-19 Thread Simeonov, Daniel

Hi,
   What is the preferred way of monitoring Cassandra clusters? Is Cassandra 
integrated with Ganglia? Thank you very much!
Best regards, Daniel.

0.6 insert performance .... Re: [RELEASE] 0.6.1

2010-04-19 Thread Masood Mortazavi

I wonder if anyone can use:
 * Add logging of GC activity (CASSANDRA-813)
to confirm this:
  http://www.slideshare.net/schubertzhang/cassandra-060-insert-throughput

- m.


On Sun, Apr 18, 2010 at 6:58 PM, Eric Evans eev...@rackspace.com wrote:


 Hot on the trails of 0.6.0 comes our latest, 0.6.1. This stable point
 release contains a number of important bugfixes[1] and is a painless
 upgrade from 0.6.0.

 Enjoy!

 [1]: http://bit.ly/9NqwAb (changelog)

 --
 Eric Evans
 eev...@rackspace.com

RE: 0.6 insert performance .... Re: [RELEASE] 0.6.1

2010-04-19 Thread Mark Jones

I'm seeing some issues like this as well, in fact, I think seeing your graphs 
has helped me understand the dynamics of my cluster better.

Using some ballpark figures for inserting single column objects of ~500 bytes 
onto individual nodes(not when combined as a cluster):

Node1: Inserts 12000/s
Node2: Inserts 12000/s
Node3: Inserts 9000/s
Node4: Inserts 6000/s

When combined as a cluster, inserts are around 7000/s (replication factor of 2)

When GC kicks in anywhere in the cluster, Quorum writes slowdown for everyone 
associated with that node.  And the fact that there are 4 Nodes, almost implies 
garbage collection will be going on somewhere almost all the time.

So while I should be able to write more than 12,000/second, my slowest node in 
the cluster seems to overwhelm the faster nodes and drag everyone down.  I'm 
still running tests of various combinations to see where things work out.

From: Masood Mortazavi [mailto:masoodmortaz...@gmail.com]
Sent: Monday, April 19, 2010 6:15 AM
To: user@cassandra.apache.org; d...@cassandra.apache.org
Subject: 0.6 insert performance  Re: [RELEASE] 0.6.1

I wonder if anyone can use:
 * Add logging of GC activity (CASSANDRA-813)
to confirm this:
  http://www.slideshare.net/schubertzhang/cassandra-060-insert-throughput

- m.

On Sun, Apr 18, 2010 at 6:58 PM, Eric Evans 
eev...@rackspace.commailto:eev...@rackspace.com wrote:

Hot on the trails of 0.6.0 comes our latest, 0.6.1. This stable point
release contains a number of important bugfixes[1] and is a painless
upgrade from 0.6.0.

Enjoy!

[1]: http://bit.ly/9NqwAb (changelog)

--
Eric Evans
eev...@rackspace.commailto:eev...@rackspace.com

Re: Regarding Cassandra Scalability

2010-04-19 Thread Gary Dusbabek

On Sun, Apr 18, 2010 at 11:14, dir dir sikerasa...@gmail.com wrote:
 Hi Gary,

The main reason is that the compaction operation (removing deleted
values) currently requires that an entire row be read into memory.

 Thank you for your explanation. But I still do not understand what do you
 mean.


When you delete a column in cassandra, the data is not really deleted.
 Instead a flag is turned on indicating the column is no longer valid
(we call it a 'tombstone').  During compaction the column family is
scanned and the tombstones are truly deleted.

 in my opinion, Actually the row contents must fit in available memory.
 if row contents are not fit in available memory, our software will raise
 exception out of memory. since it is true( the row contents must fit in
 available memory),
 then why you said that is a problem which it (Cassandra) cannot solved??


It was not correct of me to say it is a problem that cassandra cannot
solve.  Memory-efficient compactions will be addressed.

 You say: compaction operation requires that entire row be read into memory
 whether this is a problem of out of memory??  When we need to perform
 compaction operation?? In what situation we shall perform compaction
 operation??

You will need to address the large rows yourself (consider breaking
them up).  You can identify these rows during compaction by setting
RowWarningThresholdInMB in storage-conf.xml.  When a big enough row
comes along, it is logged so you can go back later and address the
problem.

Regards,

Gary.

 Thank You.

 Dir.

RE: Cassandra Java Client

2010-04-19 Thread Dop Sun

May I take this chance to share this link here:

http://code.google.com/p/jassandra/

 

It currently based with Cassandra 0.6 Thrift APIs.

 

The class ThriftCriteria and ThriftColumnFamily has direct use of Thrift
API. Also, the site itself has test code, which is actually works on
Jassandra abstraction.

 

Dop 

 

From: Nirmala Agadgar [mailto:nirmala...@gmail.com] 
Sent: Friday, April 16, 2010 5:56 PM
To: user@cassandra.apache.org
Subject: Cassandra Java Client

 

Hi,

Can anyone tell how to implement Client that can insert data into cassandra
in Java. Any Code or guidelines would be helpful.

-
Nirmala

Re: Cassandra Java Client

2010-04-19 Thread Jonathan Ellis

How is Jassandra different from http://github.com/rantav/hector ?

On Mon, Apr 19, 2010 at 9:21 AM, Dop Sun su...@dopsun.com wrote:
 May I take this chance to share this link here:

 http://code.google.com/p/jassandra/



 It currently based with Cassandra 0.6 Thrift APIs.



 The class ThriftCriteria and ThriftColumnFamily has direct use of Thrift
 API. Also, the site itself has test code, which is actually works on
 Jassandra abstraction.



 Dop



 From: Nirmala Agadgar [mailto:nirmala...@gmail.com]
 Sent: Friday, April 16, 2010 5:56 PM
 To: user@cassandra.apache.org
 Subject: Cassandra Java Client



 Hi,

 Can anyone tell how to implement Client that can insert data into cassandra
 in Java. Any Code or guidelines would be helpful.

 -
 Nirmala

RE: Cassandra Java Client

2010-04-19 Thread Dop Sun

Well, there are couple of points while Jassandra is created:

1. First of all, I want to create something like that is because I come from
JDBC background, and familiar with Hibernate API. The ICriteria (which is
created for querying) is inspired by the Criteria API from hibernate.

Actually, maybe because of this background, it cost me a lot efforts try to
understand Cassandra in the beginning and Thrift API also takes time to use.

2. The Jassandra creates a layer, which removes the direct link to
underlying Thrift API (including the exceptions, ConsistencyLevel
enumeration etc)

High light this point because I believe the client of the Jassandra will
benefit for the implementation changes in future, for example, if the
Cassandra provides better Thrift API to selecting the columns for a list of
keys, SCFs, or deprecating some structures, exceptions, the client may not
be changed. Of cause, if Jassandra failed to approve itself, this is
actually not the advantage. :)

3. The Jassandra is designed to be an JDBC like API, no less, no more. It
strives to use the best API to do the quering (with token, key, SCF/ CF),
doing the CRUD, but no more than that. For example, it does not cover any
API like object mapping. But it should cover all the API functionalities
Thrift provided.

These 3 points, are different from Hector (I should be honest that I have
not tried to use it before, the feeling of difference are coming from the
sample code Hector provided).

So, the API Jassandra abstracted was something like this:

IConnection connection = DriverManager.getConnection(
thrift://localhost:9160, info);
try {
  // 2. Get a KeySpace by name
  IKeySpace keySpace = connection.getKeySpace(Keyspace1);

  // 3. Get a ColumnFamily by name
  IColumnFamily cf = keySpace.getColumnFamily(Standard2);

  // 4. Insert like this
  long now = System.currentTimeMillis();
  ByteArray nameFirst = ByteArray.ofASCII(first);
  ByteArray nameLast = ByteArray.ofASCII(last);
  ByteArray nameAge = ByteArray.ofASCII(age);
  ByteArray valueLast = ByteArray.ofUTF8(Smith);
  IColumn colFirst = new Column(nameFirst, ByteArray.ofUTF8(John),
now);
  cf.insert(userName, colFirst);

  IColumn colLast = new Column(nameLast, valueLast, now);
  cf.insert(userName, colLast);

  IColumn colAge = new Column(nameAge, ByteArray.ofLong(42), now);
  cf.insert(userName, colAge);

  // 5. Select like this
  ICriteria criteria = cf.createCriteria();
  criteria.keyList(Lists.newArrayList(userName))
  .columnRange(nameAge, nameLast, 10);
  MapString, ListIColumn map = criteria.select();
  ListIColumn list = map.get(userName);
  Assert.assertEquals(3, list.size());
  Assert.assertEquals(valueLast, list.get(2).getValue());

  // 6. Delete like this
  cf.delete(userName, colFirst);
  map = criteria.select();
  Assert.assertEquals(2, map.get(userName).size());

  // 7. Get count like this
  criteria = cf.createCriteria();
  criteria.keyList(Lists.newArrayList(userName));
  int count = criteria.count();
  Assert.assertEquals(2, count);
} finally {
  // 8. Don't forget to close the connection.
  connection.close();
}
  }

-Original Message-
From: Jonathan Ellis [mailto:jbel...@gmail.com] 
Sent: Monday, April 19, 2010 10:35 PM
To: user@cassandra.apache.org
Subject: Re: Cassandra Java Client

How is Jassandra different from http://github.com/rantav/hector ?

On Mon, Apr 19, 2010 at 9:21 AM, Dop Sun su...@dopsun.com wrote:
 May I take this chance to share this link here:

 http://code.google.com/p/jassandra/



 It currently based with Cassandra 0.6 Thrift APIs.



 The class ThriftCriteria and ThriftColumnFamily has direct use of Thrift
 API. Also, the site itself has test code, which is actually works on
 Jassandra abstraction.



 Dop



 From: Nirmala Agadgar [mailto:nirmala...@gmail.com]
 Sent: Friday, April 16, 2010 5:56 PM
 To: user@cassandra.apache.org
 Subject: Cassandra Java Client



 Hi,

 Can anyone tell how to implement Client that can insert data into
cassandra
 in Java. Any Code or guidelines would be helpful.

 -
 Nirmala

tcp CLOSE_WAIT bug

2010-04-19 Thread Ingram Chen

Hi all,

We have observed several connections between nodes in CLOSE_WAIT after
several hours of operation:

At node 87:

netstat -tn | grep 7000
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:57625
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:51541
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:58447
ESTABLISHED
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:51313
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:52065
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:58218
CLOSE_WAIT
tcp0  0 :::192.168.2.87:54986   :::192.168.2.88:7000
ESTABLISHED
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:48272
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:55433
CLOSE_WAIT
tcp0  0 :::192.168.2.87:59138   :::192.168.2.88:7000
ESTABLISHED
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:39074
ESTABLISHED
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:59088
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:34012
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:55806
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:42472
CLOSE_WAIT
tcp0  0 :::192.168.2.87:7000:::192.168.2.88:45033
CLOSE_WAIT

At the other node: 88

netstat -tn | grep 7000
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:59138
ESTABLISHED
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:46143
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:38202
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:55852
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:39208
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:55378
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:51061
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:44911
CLOSE_WAIT
tcp0  0 :::192.168.2.88:58447   :::192.168.2.87:7000
ESTABLISHED
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:59614
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:35033
CLOSE_WAIT
tcp0  0 :::192.168.2.88:39074   :::192.168.2.87:7000
ESTABLISHED
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:54986
ESTABLISHED
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:54772
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:39925
CLOSE_WAIT
tcp0  0 :::192.168.2.88:7000:::192.168.2.87:38124
CLOSE_WAIT

the setup only uses two nodes, replication factor = 2 with latest jdk 6u20
and cassandra 0.6.0

Afaik CLOSE_WAIT indicates there are opened sockets do not close properly.
Is anyone experience similar problem ? How do I do to find the root cause ?

Any help is appreciated.

Re: tcp CLOSE_WAIT bug

2010-04-19 Thread Ingram Chen

Thank your information.

We do use connection pools with thrift client and ThriftAdress is on port
9160.

Those problematic connections we found are all in port 7000, which is
internal communications port between
nodes. I guess this related to StreamingService.

On Mon, Apr 19, 2010 at 23:46, Brandon Williams dri...@gmail.com wrote:

 On Mon, Apr 19, 2010 at 10:27 AM, Ingram Chen ingramc...@gmail.comwrote:

 Hi all,

 We have observed several connections between nodes in CLOSE_WAIT after
 several hours of operation:


 This is symptomatic of not pooling your client connections correctly.  Be
 sure you're using one connection per thread, not one connection per
 operation.

 -Brandon




-- 
Ingram Chen
online share order: http://dinbendon.net
blog: http://www.javaworld.com.tw/roller/page/ingramchen

RE: 0.6 insert performance .... Re: [RELEASE] 0.6.1

2010-04-19 Thread Daniel Kluesing

We see this behavior as well with 0.6, heap usage graphs look almost identical. 
The GC is a noticeable bottleneck, we've tried jdku19 and jrockit vm's. It 
basically kills any kind of soft real time behavior.

From: Masood Mortazavi [mailto:masoodmortaz...@gmail.com]
Sent: Monday, April 19, 2010 4:15 AM
To: user@cassandra.apache.org; d...@cassandra.apache.org
Subject: 0.6 insert performance  Re: [RELEASE] 0.6.1

I wonder if anyone can use:
 * Add logging of GC activity (CASSANDRA-813)
to confirm this:
  http://www.slideshare.net/schubertzhang/cassandra-060-insert-throughput

- m.

On Sun, Apr 18, 2010 at 6:58 PM, Eric Evans 
eev...@rackspace.commailto:eev...@rackspace.com wrote:

Hot on the trails of 0.6.0 comes our latest, 0.6.1. This stable point
release contains a number of important bugfixes[1] and is a painless
upgrade from 0.6.0.

Enjoy!

[1]: http://bit.ly/9NqwAb (changelog)

--
Eric Evans
eev...@rackspace.commailto:eev...@rackspace.com

Map/Reduce Cassandra Output

2010-04-19 Thread Sonny Heer

Different from the wordcount my input source is a directory, and I
have the a split class and record reader defined.

Different from wordcount during reduce I need to insert into
Cassandra.  I notice for the wordcount input it retrieves a handle on
a cassandra client like this:

TSocket socket = new
TSocket(DatabaseDescriptor.getSeeds().iterator().next().getHostAddress(),
 DatabaseDescriptor.getThriftPort());
TBinaryProtocol binaryProtocol = new TBinaryProtocol(socket,
false, false);
Cassandra.Client client = new Cassandra.Client(binaryProtocol);

Would all hadoop nodes go to the same seed if i use this code to
insert data, without balancing it?  Has this been done somewhere in
the Cassandra code already?

Re: [RELEASE] 0.6.0

2010-04-19 Thread Ted Zlatanov

On Wed, 14 Apr 2010 13:09:13 -0500 Ted Zlatanov t...@lifelogs.com wrote: 

TZ On Wed, 14 Apr 2010 12:23:19 -0500 Eric Evans eev...@rackspace.com wrote: 
EE On Wed, 2010-04-14 at 10:16 -0500, Ted Zlatanov wrote:
 Can it support a non-root user through /etc/default/cassandra?  I've
 been patching the init script myself but was hoping this would be
 standard. 

EE It's the first item on debian/TODO, but, you know, patches welcome and
EE all that.

TZ The appended patch has been sufficient for me.

Eric, do you need me to open a ticket for this, too, or is what I posted
sufficient?

Thanks
Ted

Modelling assets and user permissions

2010-04-19 Thread tsuraan

Suppose I have a CF that holds some sort of assets that some users of
my program have access to, and that some do not.  In SQL-ish terms it
would look something like this:

TABLE Assets (
  asset_id serial primary key,
  ...
);

TABLE Users (
  user_id serial primary key,
  user_name text
);

TABLE Permissions (
  asset_id integer references(Assets),
  user_id integer references(Users)
)

Now, I can generate UUIDs for my asset keys without any trouble, so
the serial that I have in my pseudo-SQL Assets table isn't a problem.
My problem is that I can't see a good way to model the relationship
between user ids and assets.  I see one way to do this, which has
problems, and I think I sort of see a second way.

The obvious way to do it is have the Assets CF have a SuperColumn that
somehow enumerates the users allowed to see it, so when retrieving a
specific Asset I can retrieve the users list and ensure that the user
doing the request is allowed to see it.  This has quite a few
problems.  The foremost is that Cassandra doesn't appear to have much
for conflict resolution (at least I can't find any docs on it), so if
two processes try to add permissions to the same Asset, it looks like
one process will win and I have no idea what happens to the loser.
Another problem is that Cassandra's SuperColumns don't appear to be
ideal for storing lists of things; they store maps, which isn't a
terrible problem, but it feels like a bit of a mismatch in my design.
A SuperColumn mapping from user_ids to an empty byte array seems like
it should work pretty efficiently for checking whether a user has
permissions on an Asset, but it also seems pretty evil.

The other idea that I have is a seperate CF for AssetPermissions that
somehow stores pairs of asset_ids and user_names.  I don't know what
I'd use for a key in that situation, so I haven't really gotten too
far in seeing what else is broken with that idea.  I think it would
get around the race condition, but I don't know how to do it, and I'm
not sure how efficient it could be.

What do people normally use in this situation?  I assume it's a pretty
common problem, but I haven't see it in the various data modelling
examples on the Wiki.

Re: Cassandra Java Client

2010-04-19 Thread Ran Tavory

Hi Dop, you may want to look at hector as a low level cassandra client on
which you build jassandra, adding hibernate style magic etc like other ppl
have done with ORM layers on top of it.
Hector's main features include extensive jmx counters, failover and
connection pooling.
It's available for all recent versions, including 0.5.0, 0.5.1, 0.6.0 and
0.6.1

On Mon, Apr 19, 2010 at 5:58 PM, Dop Sun su...@dopsun.com wrote:

 Well, there are couple of points while Jassandra is created:

 1. First of all, I want to create something like that is because I come
 from
 JDBC background, and familiar with Hibernate API. The ICriteria (which is
 created for querying) is inspired by the Criteria API from hibernate.

 Actually, maybe because of this background, it cost me a lot efforts try to
 understand Cassandra in the beginning and Thrift API also takes time to
 use.

 2. The Jassandra creates a layer, which removes the direct link to
 underlying Thrift API (including the exceptions, ConsistencyLevel
 enumeration etc)

 High light this point because I believe the client of the Jassandra will
 benefit for the implementation changes in future, for example, if the
 Cassandra provides better Thrift API to selecting the columns for a list of
 keys, SCFs, or deprecating some structures, exceptions, the client may not
 be changed. Of cause, if Jassandra failed to approve itself, this is
 actually not the advantage. :)

 3. The Jassandra is designed to be an JDBC like API, no less, no more. It
 strives to use the best API to do the quering (with token, key, SCF/ CF),
 doing the CRUD, but no more than that. For example, it does not cover any
 API like object mapping. But it should cover all the API functionalities
 Thrift provided.

 These 3 points, are different from Hector (I should be honest that I have
 not tried to use it before, the feeling of difference are coming from the
 sample code Hector provided).

 So, the API Jassandra abstracted was something like this:

IConnection connection = DriverManager.getConnection(
thrift://localhost:9160, info);
try {
  // 2. Get a KeySpace by name
  IKeySpace keySpace = connection.getKeySpace(Keyspace1);

  // 3. Get a ColumnFamily by name
  IColumnFamily cf = keySpace.getColumnFamily(Standard2);

  // 4. Insert like this
  long now = System.currentTimeMillis();
  ByteArray nameFirst = ByteArray.ofASCII(first);
  ByteArray nameLast = ByteArray.ofASCII(last);
  ByteArray nameAge = ByteArray.ofASCII(age);
  ByteArray valueLast = ByteArray.ofUTF8(Smith);
  IColumn colFirst = new Column(nameFirst, ByteArray.ofUTF8(John),
 now);
  cf.insert(userName, colFirst);

  IColumn colLast = new Column(nameLast, valueLast, now);
  cf.insert(userName, colLast);

  IColumn colAge = new Column(nameAge, ByteArray.ofLong(42), now);
  cf.insert(userName, colAge);

  // 5. Select like this
  ICriteria criteria = cf.createCriteria();
  criteria.keyList(Lists.newArrayList(userName))
  .columnRange(nameAge, nameLast, 10);
  MapString, ListIColumn map = criteria.select();
  ListIColumn list = map.get(userName);
  Assert.assertEquals(3, list.size());
  Assert.assertEquals(valueLast, list.get(2).getValue());

  // 6. Delete like this
  cf.delete(userName, colFirst);
  map = criteria.select();
  Assert.assertEquals(2, map.get(userName).size());

  // 7. Get count like this
  criteria = cf.createCriteria();
  criteria.keyList(Lists.newArrayList(userName));
  int count = criteria.count();
  Assert.assertEquals(2, count);
} finally {
  // 8. Don't forget to close the connection.
  connection.close();
 }
  }

 -Original Message-
 From: Jonathan Ellis [mailto:jbel...@gmail.com]
 Sent: Monday, April 19, 2010 10:35 PM
 To: user@cassandra.apache.org
 Subject: Re: Cassandra Java Client

 How is Jassandra different from http://github.com/rantav/hector ?

 On Mon, Apr 19, 2010 at 9:21 AM, Dop Sun su...@dopsun.com wrote:
  May I take this chance to share this link here:
 
  http://code.google.com/p/jassandra/
 
 
 
  It currently based with Cassandra 0.6 Thrift APIs.
 
 
 
  The class ThriftCriteria and ThriftColumnFamily has direct use of Thrift
  API. Also, the site itself has test code, which is actually works on
  Jassandra abstraction.
 
 
 
  Dop
 
 
 
  From: Nirmala Agadgar [mailto:nirmala...@gmail.com]
  Sent: Friday, April 16, 2010 5:56 PM
  To: user@cassandra.apache.org
  Subject: Cassandra Java Client
 
 
 
  Hi,
 
  Can anyone tell how to implement Client that can insert data into
 cassandra
  in Java. Any Code or guidelines would be helpful.
 
  -
  Nirmala

Re: [RELEASE] 0.6.0

2010-04-19 Thread Eric Evans

On Mon, 2010-04-19 at 12:02 -0500, Ted Zlatanov wrote:
 
 EE It's the first item on debian/TODO, but, you know, patches welcome
 and
 EE all that.
 
 TZ The appended patch has been sufficient for me.
 
 Eric, do you need me to open a ticket for this, too, or is what I
 posted sufficient? 

Feel free to open a ticket, that never hurts.

I had planned to use the maintainer scripts to create a system user (in
an idempotent way), with a default configuration that used this new
user. I had also planned to ensure that permissions were updated
accordingly when upgrading from a previous version.

-- 
Eric Evans
eev...@rackspace.com

PropertyFileEndPointSnitch

2010-04-19 Thread Erik Holstad

When building the PropertyFileEndPointSnitch into the jar
cassandra-propsnitch.jar
the files in the jar end up on
src/java/org/apache/cassandra/locator/PropertyFileEndPointSnitch.class
instead of org/apache/cassandra/locator/PropertyFileEndPointSnitch.class. Am
I doing something wrong
, is this intended behavior or is it a bug?

-- 
Regards Erik

RE: Map/Reduce Cassandra Output

2010-04-19 Thread Stu Hood

If you used that snippet of code, all connections would go through the same 
seed: the input code does additional work to determine which nodes are holding 
particular key ranges, and then connects directly.



For outputting from Hadoop to Cassandra, you may want to consider using a Java 
client like Hector, which will handle the load balancing for you.

http://github.com/rantav/hector

Thanks,
Stu

-Original Message-
From: Sonny Heer sonnyh...@gmail.com
Sent: Monday, April 19, 2010 11:29am
To: cassandra-u...@incubator.apache.org
Subject: Map/Reduce Cassandra Output

Different from the wordcount my input source is a directory, and I
have the a split class and record reader defined.

Different from wordcount during reduce I need to insert into
Cassandra.  I notice for the wordcount input it retrieves a handle on
a cassandra client like this:

TSocket socket = new
TSocket(DatabaseDescriptor.getSeeds().iterator().next().getHostAddress(),
 DatabaseDescriptor.getThriftPort());
TBinaryProtocol binaryProtocol = new TBinaryProtocol(socket,
false, false);
Cassandra.Client client = new Cassandra.Client(binaryProtocol);

Would all hadoop nodes go to the same seed if i use this code to
insert data, without balancing it?  Has this been done somewhere in
the Cassandra code already?

restore with snapshot

2010-04-19 Thread Lee Parker

I am working on finalizing our backup and restore procedures for a cassandra
cluster running on EC2. I understand based on the wiki that in order to
replace a single node, I don't actually need to put data on that node.  I
just need to bootstrap the new node into the cluster and it will get data
from the other nodes.  However, would is speed up the process if that node
already has the data from the node it is replacing?  Also, what do I do if
the entire cluster goes down?  I am planning to snapshot the data each night
for each node.  Should I save the system keyspace snapshots?  Is it
problematic to bring the cluster back up with new ips on each node, but the
same tokens as before?

Lee Parker

Re: Data model question - column names sort

2010-04-19 Thread Jonathan Ellis

On Thu, Apr 15, 2010 at 6:01 PM, Sonny Heer sonnyh...@gmail.com wrote:
 Need a way to have two different types of indexes.

 Key: aTextKey
 ColumnName: aTextColumnName:55
 Value: 

 Key: aTextKey
 ColumnName: 55:aTextColumnName
 Value: 

 All the valuable information is stored in the column name itself.
 Above two can be in different column families...

 Queries:
 Given a key, page me a list of numerical values sorted on aTextColumnName
 Given a key, page me a list of text values sorted on a numerical value

 This approach would require left padding the numeric value for the
 second index so cassandra can sort on column names correctly.

Don't do that, pack the numeric value into a fixed-length byte array
instead.  Then you don't have to do any expensive string operations in
the comparator.

-Jonathan

RE: Cassandra Java Client

2010-04-19 Thread Dop Sun

Hi Ran:

 

Yep, looks like there is possibility that I can add dependencies to hector, and 
enhance the functionality to Jassandra.

 

I would take this chance to extend the discussion about “xxx Client for 
Cassandra” a little bit:

 

In short, Cassandra may need a kind of sub-project to define the “xxx-client 
for Cassandra” for most of the popular platforms (like Python, Java, .NET), or, 
it defines a framework (standard, guideline, or whatever), and let the 
community to port/ implement in different language/ platform. 

 

I believe Cassandra product itself needs the flexibility to change Thrift API 
at any given time, including deprecating the old API, which may have bad 
performance, or adding new API to cover new functionality, but the production 
deployed applications build on Cassandra in general (in general, means not the 
company like FaceBook, Digg, who has huge team to follow the changes of 
Cassandra) cannot bear this. And if these products depend on the Thrift API, 
which means eventually, these deployment will be left behind with old version.

 

This problem happens in RDBMS world, and eventually, it’s resolved with xDBC 
APIs. The new database coming out, it provides the general features + special 
features. So, the applications built on database technologies, in most of the 
cases, can move smoothly to the latest database server versions.

 

With another layer of abstraction, the performance of the application does not 
necessarily to be slower, since the newly introduced API developer, can try to 
use the latest version of the Thrift API to provide the best performance for 
the requests of the application. An immediate example would be: in the Thrift 
API, the querying API (get_xxx, mult_get, and get_key_slice) has been enhanced 
a lot. For the application, before “xxx-client for Cassandra” updated, it can 
work with new version of Cassandra with using the old API. And once the 
“xxx-client” updated, it will immediately using the new features even without 
change the application codes.

 

The reason I believe this “xxx-client for Cassandra” best to be a sub-project, 
because since the API of Cassandra is changing, at this stage, the design of 
such an API need lot of inside details/ guide. It’s very difficult for people 
from outside, like me, to define some API based on guessing, which may 
eventually can be flexible enough to support feature Cassandra. I can see 
several “xxx client for Cassandra” project eventually abandoned, and my guess 
is because of this. Maybe one day, Jassandra also cannot be further extended to 
meet the new API of future version of Cassandra, and I may abandon it as well. J

 

Cheers~~~

Dop 

 

From: Ran Tavory [mailto:ran...@gmail.com] 
Sent: Tuesday, April 20, 2010 1:36 AM
To: user@cassandra.apache.org
Subject: Re: Cassandra Java Client

 

Hi Dop, you may want to look at hector as a low level cassandra client on which 
you build jassandra, adding hibernate style magic etc like other ppl have done 
with ORM layers on top of it.

Hector's main features include extensive jmx counters, failover and connection 
pooling. 

It's available for all recent versions, including 0.5.0, 0.5.1, 0.6.0 and 0.6.1

On Mon, Apr 19, 2010 at 5:58 PM, Dop Sun su...@dopsun.com wrote:

Well, there are couple of points while Jassandra is created:

1. First of all, I want to create something like that is because I come from
JDBC background, and familiar with Hibernate API. The ICriteria (which is
created for querying) is inspired by the Criteria API from hibernate.

Actually, maybe because of this background, it cost me a lot efforts try to
understand Cassandra in the beginning and Thrift API also takes time to use.

2. The Jassandra creates a layer, which removes the direct link to
underlying Thrift API (including the exceptions, ConsistencyLevel
enumeration etc)

High light this point because I believe the client of the Jassandra will
benefit for the implementation changes in future, for example, if the
Cassandra provides better Thrift API to selecting the columns for a list of
keys, SCFs, or deprecating some structures, exceptions, the client may not
be changed. Of cause, if Jassandra failed to approve itself, this is
actually not the advantage. :)

3. The Jassandra is designed to be an JDBC like API, no less, no more. It
strives to use the best API to do the quering (with token, key, SCF/ CF),
doing the CRUD, but no more than that. For example, it does not cover any
API like object mapping. But it should cover all the API functionalities
Thrift provided.

These 3 points, are different from Hector (I should be honest that I have
not tried to use it before, the feeling of difference are coming from the
sample code Hector provided).

So, the API Jassandra abstracted was something like this:

   IConnection connection = DriverManager.getConnection(
   thrift://localhost:9160, info);
   try {
 // 2. Get a KeySpace by name
 IKeySpace keySpace =

Re: Clarification on Ring operations in Cassandra 0.5.1

2010-04-19 Thread Jonathan Ellis

On Thu, Apr 15, 2010 at 6:10 PM, Anthony Molinaro
antho...@alumni.caltech.edu wrote:
 1) shutdown cassandra on instance I want to replace
 2) create a new instance, start cassandra with AutoBootstrap = true
 3) run nodeprobe removetoken against the token of the instance I am
   replacing

 Then according to the 'Handling failure' the new instance will find the
 appropriate position automatically.  However, it's not clear to me
 if this means it will take the same range as the shutdown node or not,
 because normally AutoBootstrap == true means it will take half the keys
 from the node with the most disk space used. (from the 'Bootstrap' section).

 So will the process I describe above result in what I want, a new node
 replacing an old one?

As you noted, it does not exactly replace the old one.  If you require
the token to be the same as the dead one, then you should manually
move the new node, after removing the dead one.

 how
 does removetoken know which instance to remove, does it remove the Down
 instance?

Tokens are unique per node.  (Those are the values you see in nodetool ring.)

 Another hopefully minor question, if I bring up a new node with
 AutoBootstrap = false, what happens?
 Does it join the ring but without data

Yes.

 and without token range?

No.  (This is why you should not do that.)

 Can I then 'nodeprobe move token for range I want to take over', and
 achieve the same as step 2 above?

You can't have two nodes with the same token in the ring at once.  So,
you can removetoken the old node first, then bootstrap the new one
(just specify InitialToken in the config to avoid having it guess
one), or you can make it a 3 step process (bootstrap, remove, move) to
avoid transferring so much data around.

-Jonathan

Re: effective modeling for fixed limit columns

2010-04-19 Thread Jonathan Ellis

Limiting by number of columns in a row will perform very poorly.

Limiting by the time a column has existed can perform quite well, and
was added by Sylvain for 0.7 in
https://issues.apache.org/jira/browse/CASSANDRA-699

On Fri, Apr 16, 2010 at 1:50 PM, Chris Shorrock ch...@shorrockin.com wrote:
 I'm attempting to come up with a technique for limiting the number of
 columns a single key (or super column - doesn't matter too much for the
 context of this conversation) may contain at any one time. My actual
 use-case is a little too meaty to try to describe so an alternate use-case
 of this mechanism could be:

 Construct a twitter-esque feed which maintains a list N tweets. Tweets (in
 this system - and in reality I suppose) occur at such a rate that you want
 to limit a given users feed to N items. You do not have the ability to
 store an infinite number of tweets due to the physical constraints of your
 hardware.

 The my first idea answer is when a tweet is inserted into the the feed of
 a given person, that you then do a count and delete of any outstanding
 tweets. In reality you could first count, then (if count = N) do a batch
 mutate for the insertion of the new entry and the removal of the old. My
 issue with this approach is that after a certain point every new entry into
 the system will incur the removal of an old entry. The count, once a feed
 has reached N will always be = N on any subsequent queries. Depending on
 how you index the tweets you may need to actually do a read instead of count
 to get the row identifiers.
 My second approach was to utilize a slot system where you have a record
 stored somewhere that indicates the next slot for insertion. This can be
 thought of as a fixed length array where you store the next insertion point
 in some other column family. When a new tweet occurs you retrieve the
 current slot meta-data, insert into that index, then update the meta-data
 for the next insertion. My concerns with this relate around synchronization
 and losing entries due to concurrent operations. I'd rather not have to
 something like ZooKeeper to synchronize in the application cluster.
 I have some other ideas but I'm mostly just spit-balling at this point. So I
 thought I'd reach out the collective intelligence of the group to see if
 anyone has implemented something similar. Thanks in advance.

Re: why read operation use so much of memory?

2010-04-19 Thread Jonathan Ellis

(Moving to users@ list.)

Like any Java server, Cassandra will use as much memory in its heap as
you allow it to.  You can request a GC from jconsole to see what its
approximate real working set it.

http://wiki.apache.org/cassandra/SSTableMemtable explains why reads
are slower than writes.  You can tune this by using the key cache, row
cache, or by using range queries instead of requesting rows one at a
time.

contrib/py_stress is a better starting place for a benchmark than
rolling your own, btw.  we see about 8000 reads/s with that on a
4-core server.

On Sun, Apr 18, 2010 at 8:40 PM, Bingbing Liu rucb...@gmail.com wrote:
 Hi,all

 I have a cluster of 5 nodes, each node has a 4 cores cpu and 8 G Memory.

 I use the 0.6-beta3 cassandra for testting.

 First , i insert 6,000,000 rows each of which is 1k bytes, the speed of write 
 is so excited.

 But then ,when i read them each row at a time from two clients at the same 
 time ,one of the client is very slow and use so long a time,

 i find that on each node the process of Cassandra occupy 7 G memory or so 
 (use the top command), that puzzled me.

 Why read operation use so much of memory? May be i missed something?

 Thx.


 2010-04-18



 Bingbing Liu

Re: cassandra monitoring

2010-04-19 Thread Jonathan Ellis

Anything that can consume JMX.

On Mon, Apr 19, 2010 at 5:34 AM, Simeonov, Daniel
daniel.simeo...@sap.com wrote:
 Hi,
    What is the preferred way of monitoring Cassandra clusters? Is Cassandra
 integrated with Ganglia? Thank you very much!
 Best regards, Daniel.

Re: tcp CLOSE_WAIT bug

2010-04-19 Thread Jonathan Ellis

Is this after doing a bootstrap or other streaming operation?  Or did
a node go down?

The internal sockets are supposed to remain open, otherwise.

On Mon, Apr 19, 2010 at 10:56 AM, Ingram Chen ingramc...@gmail.com wrote:
 Thank your information.

 We do use connection pools with thrift client and ThriftAdress is on port
 9160.

 Those problematic connections we found are all in port 7000, which is
 internal communications port between
 nodes. I guess this related to StreamingService.

 On Mon, Apr 19, 2010 at 23:46, Brandon Williams dri...@gmail.com wrote:

 On Mon, Apr 19, 2010 at 10:27 AM, Ingram Chen ingramc...@gmail.com
 wrote:

 Hi all,

     We have observed several connections between nodes in CLOSE_WAIT
 after several hours of operation:

 This is symptomatic of not pooling your client connections correctly.  Be
 sure you're using one connection per thread, not one connection per
 operation.
 -Brandon


 --
 Ingram Chen
 online share order: http://dinbendon.net
 blog: http://www.javaworld.com.tw/roller/page/ingramchen

Re: 0.6 insert performance .... Re: [RELEASE] 0.6.1

2010-04-19 Thread Jonathan Ellis

It's hard to tell from those slides, but it looks like the slowdown
doesn't hit until after several GCs.

Perhaps this is compaction kicking in, not GCs?  Definitely the extra
I/O + CPU load from compaction will cause a drop in throughput.

On Mon, Apr 19, 2010 at 6:14 AM, Masood Mortazavi
masoodmortaz...@gmail.com wrote:
 I wonder if anyone can use:
  * Add logging of GC activity (CASSANDRA-813)
 to confirm this:
   http://www.slideshare.net/schubertzhang/cassandra-060-insert-throughput

 - m.


 On Sun, Apr 18, 2010 at 6:58 PM, Eric Evans eev...@rackspace.com wrote:

 Hot on the trails of 0.6.0 comes our latest, 0.6.1. This stable point
 release contains a number of important bugfixes[1] and is a painless
 upgrade from 0.6.0.

 Enjoy!

 [1]: http://bit.ly/9NqwAb (changelog)

 --
 Eric Evans
 eev...@rackspace.com

Re: Map/Reduce Cassandra Output

2010-04-19 Thread Sonny Heer

Thanks Stu.  I will take a look at Hector.  Do you know where the
input code does the additional work?



On Mon, Apr 19, 2010 at 11:20 AM, Stu Hood stu.h...@rackspace.com wrote:
 If you used that snippet of code, all connections would go through the same 
 seed: the input code does additional work to determine which nodes are 
 holding particular key ranges, and then connects directly.

 

 For outputting from Hadoop to Cassandra, you may want to consider using a 
 Java client like Hector, which will handle the load balancing for you.

 http://github.com/rantav/hector

 Thanks,
 Stu

 -Original Message-
 From: Sonny Heer sonnyh...@gmail.com
 Sent: Monday, April 19, 2010 11:29am
 To: cassandra-u...@incubator.apache.org
 Subject: Map/Reduce Cassandra Output

 Different from the wordcount my input source is a directory, and I
 have the a split class and record reader defined.

 Different from wordcount during reduce I need to insert into
 Cassandra.  I notice for the wordcount input it retrieves a handle on
 a cassandra client like this:

        TSocket socket = new
 TSocket(DatabaseDescriptor.getSeeds().iterator().next().getHostAddress(),
                                     DatabaseDescriptor.getThriftPort());
        TBinaryProtocol binaryProtocol = new TBinaryProtocol(socket,
 false, false);
        Cassandra.Client client = new Cassandra.Client(binaryProtocol);

 Would all hadoop nodes go to the same seed if i use this code to
 insert data, without balancing it?  Has this been done somewhere in
 the Cassandra code already?

Re: busy thread on IncomingStreamReader ?

2010-04-19 Thread Rob Coli


On 4/17/10 6:47 PM, Ingram Chen wrote:

after upgrading jdk from  1.6.0_16 to  1.6.0_20, the problem solved.


FYI, this sounds like it might be :

https://issues.apache.org/jira/browse/CASSANDRA-896
http://bugs.sun.com/view_bug.do;jsessionid=60c39aa55d3666c0c84dd70eb826?bug_id=6805775

Where garbage collection issues in JVM/JDKs before 7.b70 leads to GC 
storming which hoses performance.


=Rob

get_range_slices in hector

2010-04-19 Thread Chris Dean

Is there a version of hector that has an interface to get_range_slices ?
or should I provide a patch?

Cheers,
Chris Dean

Re: Help with MapReduce

2010-04-19 Thread Jesse McConnell

most likely means that the count() operation is taking too long for
the configured RPCTimeout

counts get unreliable after a certain number of columns under a key in
my experience

jesse

--
jesse mcconnell
jesse.mcconn...@gmail.com



On Mon, Apr 19, 2010 at 19:12, Joost Ouwerkerk jo...@openplaces.org wrote:
 I'm slowly getting somewhere with Cassandra... I have successfully imported
 1.5 million rows using MapReduce.  This took about 8 minutes on an 8-node
 cluster, which is comparable to the time it takes with HBase.
 Now I'm having trouble scanning this data.  I've created a simple MapReduce
 job that counts rows in my ColumnFamily.  The Job fails with most tasks
 throwing the following Exception.  Anyone have any ideas what's going wrong?
 java.lang.RuntimeException: TimedOutException()

   at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
   at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
   at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
   at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
   at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
   at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
   at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
   at 
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
   at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: TimedOutException()
   at
 org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
   at
 org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
   at
 org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
   at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
   ... 11 more

 On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood stu.h...@rackspace.com wrote:

 In 0.6.0 and trunk, it is located at
 src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java

 You might be using a pre-release version of 0.6 if you are seeing a fat
 client based InputFormat.


 -Original Message-
 From: Joost Ouwerkerk jo...@openplaces.org
 Sent: Sunday, April 18, 2010 4:53pm
 To: user@cassandra.apache.org
 Subject: Re: Help with MapReduce

 Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
 have a preference about client, I just want to be consistent with
 ColumnInputFormat.

 On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood stu.h...@rackspace.com wrote:

  ColumnFamilyInputFormat no longer uses the fat client API, and instead
  uses
  Thrift. There are still some significant problems with the fat client,
  so it
  shouldn't be used without a good understanding of those problems.
 
  If you still want to use it, check out contrib/bmt_example, but I'd
  recommend that you use thrift for now.
 
  -Original Message-
  From: Joost Ouwerkerk jo...@openplaces.org
  Sent: Sunday, April 18, 2010 2:59pm
  To: user@cassandra.apache.org
  Subject: Help with MapReduce
 
  I'm a Cassandra noob trying to validate Cassandra as a viable
  alternative
  to
  HBase (which we've been using for over a year) for our application.  So
  far,
  I've had no success getting Cassandra working with MapReduce.
 
  My first step is inserting data into Cassandra.  I've created a MapRed
  job
  based using the fat client API.  I'm using the fat client (StorageProxy)
  because that's what ColumnFamilyInputFormat uses and I want to use the
  same
  API for both read and write jobs.
 
  When I call StorageProxy.mutate(), nothing happens.  The job completes
  as
  if
  it had done something, but in fact nothing has changed in the cluster.
   When
  I call StorageProxy.mutateBlocking(), I get an IOException complaining
  that
  there is no connection to the cluster.  I've concluded with the debugger
  that StorageService is not connecting to the cluster, even though I've
  specified the correct seed and ListenAddress (I've using the exact same
  storage-conf.xml as the nodes in the cluster).
 
  I'm sure I'm missing something obvious in the configuration or my setup,
  but
  since I'm new to Cassandra, I can't see what it is.
 
  Any help appreciated,
  Joost

Re: Help with MapReduce

2010-04-19 Thread Jesse McConnell

err not count in your case, but same symptom, cassandra can't return
the answer to your query in the configured rpctimeout time

cheers,
jesse

--
jesse mcconnell
jesse.mcconn...@gmail.com



On Mon, Apr 19, 2010 at 19:40, Jesse McConnell
jesse.mcconn...@gmail.com wrote:
 most likely means that the count() operation is taking too long for
 the configured RPCTimeout

 counts get unreliable after a certain number of columns under a key in
 my experience

 jesse

 --
 jesse mcconnell
 jesse.mcconn...@gmail.com



 On Mon, Apr 19, 2010 at 19:12, Joost Ouwerkerk jo...@openplaces.org wrote:
 I'm slowly getting somewhere with Cassandra... I have successfully imported
 1.5 million rows using MapReduce.  This took about 8 minutes on an 8-node
 cluster, which is comparable to the time it takes with HBase.
 Now I'm having trouble scanning this data.  I've created a simple MapReduce
 job that counts rows in my ColumnFamily.  The Job fails with most tasks
 throwing the following Exception.  Anyone have any ideas what's going wrong?
 java.lang.RuntimeException: TimedOutException()

       at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
       at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
       at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
       at
 com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
       at
 com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
       at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
       at
 org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
       at 
 org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
       at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
       at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
       at org.apache.hadoop.mapred.Child.main(Child.java:170)
 Caused by: TimedOutException()
       at
 org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
       at
 org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
       at
 org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
       at
 org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
       ... 11 more

 On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood stu.h...@rackspace.com wrote:

 In 0.6.0 and trunk, it is located at
 src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java

 You might be using a pre-release version of 0.6 if you are seeing a fat
 client based InputFormat.


 -Original Message-
 From: Joost Ouwerkerk jo...@openplaces.org
 Sent: Sunday, April 18, 2010 4:53pm
 To: user@cassandra.apache.org
 Subject: Re: Help with MapReduce

 Where is the ColumnFamilyInputFormat that uses Thrift?  I don't actually
 have a preference about client, I just want to be consistent with
 ColumnInputFormat.

 On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood stu.h...@rackspace.com wrote:

  ColumnFamilyInputFormat no longer uses the fat client API, and instead
  uses
  Thrift. There are still some significant problems with the fat client,
  so it
  shouldn't be used without a good understanding of those problems.
 
  If you still want to use it, check out contrib/bmt_example, but I'd
  recommend that you use thrift for now.
 
  -Original Message-
  From: Joost Ouwerkerk jo...@openplaces.org
  Sent: Sunday, April 18, 2010 2:59pm
  To: user@cassandra.apache.org
  Subject: Help with MapReduce
 
  I'm a Cassandra noob trying to validate Cassandra as a viable
  alternative
  to
  HBase (which we've been using for over a year) for our application.  So
  far,
  I've had no success getting Cassandra working with MapReduce.
 
  My first step is inserting data into Cassandra.  I've created a MapRed
  job
  based using the fat client API.  I'm using the fat client (StorageProxy)
  because that's what ColumnFamilyInputFormat uses and I want to use the
  same
  API for both read and write jobs.
 
  When I call StorageProxy.mutate(), nothing happens.  The job completes
  as
  if
  it had done something, but in fact nothing has changed in the cluster.
   When
  I call StorageProxy.mutateBlocking(), I get an IOException complaining
  that
  there is no connection to the cluster.  I've concluded with the debugger
  that StorageService is not connecting to the cluster, even though I've
  specified the correct seed and ListenAddress (I've using the exact same
  storage-conf.xml as the nodes in the cluster).
 
  I'm sure I'm missing something obvious in the configuration or my setup,
  but
  since I'm new to

Re: Help with MapReduce

2010-04-19 Thread Joost Ouwerkerk

hmm, might be too much data. In the case of a supercolumn, how do I specify
which sub-columns to retrieve? Or can I only retrieve entire supercolumns?

On Mon, Apr 19, 2010 at 8:47 PM, Jonathan Ellis jbel...@gmail.com wrote:

Possibly you are asking it to retrieve too many columns per row.

Possibly there is something else causing poor performance, like swapping.

On Mon, Apr 19, 2010 at 7:12 PM, Joost Ouwerkerk jo...@openplaces.org
wrote:
I'm slowly getting somewhere with Cassandra... I have successfully
imported
1.5 million rows using MapReduce. This took about 8 minutes on an 8-node
cluster, which is comparable to the time it takes with HBase.
Now I'm having trouble scanning this data. I've created a simple
MapReduce
job that counts rows in my ColumnFamily. The Job fails with most tasks
throwing the following Exception. Anyone have any ideas what's going
wrong?
java.lang.RuntimeException: TimedOutException()

org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:165)
at

org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:215)
at

org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.computeNext(ColumnFamilyRecordReader.java:97)
at

com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:135)
at

com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:130)
at

org.apache.cassandra.hadoop.ColumnFamilyRecordReader.nextKeyValue(ColumnFamilyRecordReader.java:91)
at

org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:423)
at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: TimedOutException()
at

org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:11015)
at

org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:623)
at

org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:597)
at

org.apache.cassandra.hadoop.ColumnFamilyRecordReader$RowIterator.maybeInit(ColumnFamilyRecordReader.java:142)
... 11 more

On Sun, Apr 18, 2010 at 6:01 PM, Stu Hood stu.h...@rackspace.com
wrote:

In 0.6.0 and trunk, it is located at
src/java/org/apache/cassandra/hadoop/ColumnFamilyInputFormat.java

You might be using a pre-release version of 0.6 if you are seeing a fat
client based InputFormat.

-Original Message-
From: Joost Ouwerkerk jo...@openplaces.org
Sent: Sunday, April 18, 2010 4:53pm
To: user@cassandra.apache.org
Subject: Re: Help with MapReduce

Where is the ColumnFamilyInputFormat that uses Thrift? I don't actually
have a preference about client, I just want to be consistent with
ColumnInputFormat.

On Sun, Apr 18, 2010 at 5:37 PM, Stu Hood stu.h...@rackspace.com
wrote:

ColumnFamilyInputFormat no longer uses the fat client API, and instead
uses
Thrift. There are still some significant problems with the fat client,
so it
shouldn't be used without a good understanding of those problems.

If you still want to use it, check out contrib/bmt_example, but I'd
recommend that you use thrift for now.

-Original Message-
From: Joost Ouwerkerk jo...@openplaces.org
Sent: Sunday, April 18, 2010 2:59pm
To: user@cassandra.apache.org
Subject: Help with MapReduce

I'm a Cassandra noob trying to validate Cassandra as a viable
alternative
to
HBase (which we've been using for over a year) for our application.
So
far,
I've had no success getting Cassandra working with MapReduce.

My first step is inserting data into Cassandra. I've created a MapRed
job
based using the fat client API. I'm using the fat client
(StorageProxy)
because that's what ColumnFamilyInputFormat uses and I want to use the
same
API for both read and write jobs.

When I call StorageProxy.mutate(), nothing happens. The job completes
as
if
it had done something, but in fact nothing has changed in the cluster.
When
I call StorageProxy.mutateBlocking(), I get an IOException complaining
that
there is no connection to the cluster. I've concluded with the
debugger
that StorageService is not connecting to the cluster, even though I've
specified the correct seed and ListenAddress (I've using the exact
same
storage-conf.xml as the nodes in the cluster).

I'm sure I'm missing something obvious in the configuration or my
setup,
but
since I'm new to Cassandra, I can't

Re: Help with MapReduce

2010-04-19 Thread Jonathan Ellis

the latter, if you are retrieving multiple supercolumns.

On Mon, Apr 19, 2010 at 8:10 PM, Joost Ouwerkerk jo...@openplaces.org wrote:
hmm, might be too much data. In the case of a supercolumn, how do I specify
which sub-columns to retrieve? Or can I only retrieve entire supercolumns?
On Mon, Apr 19, 2010 at 8:47 PM, Jonathan Ellis jbel...@gmail.com wrote:

Possibly you are asking it to retrieve too many columns per row.

Possibly there is something else causing poor performance, like swapping.

On Mon, Apr 19, 2010 at 7:12 PM, Joost Ouwerkerk jo...@openplaces.org
wrote:
I'm slowly getting somewhere with Cassandra... I have successfully
imported
1.5 million rows using MapReduce. This took about 8 minutes on an
8-node
cluster, which is comparable to the time it takes with HBase.
Now I'm having trouble scanning this data. I've created a simple
MapReduce
job that counts rows in my ColumnFamily. The Job fails with most tasks
throwing the following Exception. Anyone have any ideas what's going
wrong?
java.lang.RuntimeException: TimedOutException()