date:20100414

Re: Starting Cassandra Fauna

2010-04-14 Thread Nirmala Agadgar

Hi,

Can anyone please list steps to install and run cassandra in centos.
It can help me to follow and check where i missed and run correctly.
Also, if i wanted to insert some data programmatically, where i need to do
place the code in Fauna.Can anyone help me on this?

On Mon, Apr 12, 2010 at 10:36 PM, Ryan King r...@twitter.com wrote:

 I'm guessing you missed the ant ivy-retrieve step.

 We're planning on releasing a new gem today that should fix this issue.

 -ryan

 On Mon, Apr 12, 2010 at 3:30 AM, Nirmala Agadgar nirmala...@gmail.com
 wrote:
  Hi,
 
  Yes, used only master.
  i downloaded  the tar file and placed in cassandra folder and run again
  cassandra_helper cassandra
  now i am getting
  Error: Exception thrown by the agent : java.net.MalformedURLException:
 Local
  host name
  when set hostname to localhost or 127.0.0.1
   i get Exception in thread main java.lang.NoClassDefFoundError:
  org/apache/log4j/Logger
  at
 
 org.apache.cassandra.thrift.CassandraDaemon.clinit(CassandraDaemon.java:55)
  how to solve this?
  Can anyone tell steps to run cassandra or config to done?
 
  -
  Nirmala
 
 
  On Sat, Apr 10, 2010 at 10:48 PM, Jeff Hodges jhod...@twitter.com
 wrote:
 
  Did you try master? We fixed this around the 7th, but haven't made a
  release yet.
  --
  Jeff
 
  On Sat, Apr 10, 2010 at 10:10 AM, Nirmala Agadgar nirmala...@gmail.com
 
  wrote:
   Hi,
  
   I tried to dig in problem and found
   1) DIST_URL is pointed to
  
  
 http://apache.osuosl.org/incubator/cassandra/0.6.0/apache-cassandra-0.6.0-beta2-bin.tar.gz
   and it has no resource in it.( in Rakefile of  Cassandra Gem)
   DIST_URL =
  
   
 http://apache.osuosl.org/incubator/cassandra/0.6.0/apache-cassandra-0.6.0-beta2-bin.tar.gz
 
  
   2) It does not executes after
 sh tar xzf #{DIST_FILE}
  
   Can anyone help on this problem?
   Where the tar file should be downloaded?
  
  
   On Fri, Apr 9, 2010 at 3:28 AM, Jeff Hodges jhod...@twitter.com
 wrote:
  
   While I wasn't able to reproduce the error, we did have another pop
   up. I think I may have actually fixed your problem the other day.
 Pull
   the latest master from fauna/cassandra and you should be good to go.
   --
   Jeff
  
   On Thu, Apr 8, 2010 at 10:51 AM, Ryan King r...@twitter.com wrote:
Yeah, this is a known issue, we're working on it today.
   
-ryan
   
On Thu, Apr 8, 2010 at 10:31 AM, Jonathan Ellis jbel...@gmail.com
 
wrote:
Sounds like it's worth reporting on the github project then.
   
On Thu, Apr 8, 2010 at 11:53 AM, Paul Prescod pres...@gmail.com
wrote:
On Thu, Apr 8, 2010 at 9:49 AM, Jonathan Ellis 
 jbel...@gmail.com
wrote:
cassandra_helper does a bunch of magic to set things up.  looks
like
the extract a private copy of cassandra 0.6 beta2 part of the
magic
is failing.  you'll probably need to manually attempt the un-tar
to
figure out why it is bailing.
   
Yes, I had the same problem. I didn't dig into it, but perhaps
 all
users have this problem now.
   
 Paul Prescod

Re: GC options

2010-04-14 Thread Benjamin Black

FYI, G1 has been in 1.6 since u14.

2010/4/13 Peter Schüller sc...@spotify.com:
 I'm working on getting our latency as consistent as possible, and the gc 
 likes to kick off 60+ms periods of unavailability for a node, which for my 
 application leads to a reasonable number of timed out requests. Outside of 
 the gc event, we get good responses.

 I'm happy with reduced throughput for shorter pauses, so I'm going to do the 
 standard jvm gc tuning guide[0] for short pauses, curious if anyone else has 
 gone down this path and gotten gc pauses consistent and low or if what's in 
 bin/cassandra.in.sh is basically the best I should expect. (Anyone tried 
 jrockit?)

 If your situation is such that you are willing to use the unreleased
 JDK 1.7 and G1GC (still being marked as experimental and may still be
 a stability concern and since we are talking about storing data that
 probably means conservatism is called for) you can try that. It offers
 some more direct control over the target GC pause times, although does
 not provide guarantees. A potential starting point of VM options may
 be:

         -XX:+UnlockExperimentalVMOptions
         -XX:+UseG1GC
         -XX:MaxGCPauseMillis=10
         -XX:GCPauseIntervalMillis=15

 And maybe:

         -XX:G1ConfidencePercent=100

 And maybe (not sure of current status but there used to be a known bug
 when enabled):

         -XX:+G1ParallelRSetUpdatingEnabled
         -XX:+G1ParallelRSetScanningEnabled

 --
 / Peter Schuller aka scode

Re: GC options

2010-04-14 Thread Benjamin Black

Got it, thanks

2010/4/13 Peter Schüller sc...@spotify.com:
 FYI, G1 has been in 1.6 since u14.

 Yes, but (last time I checked) in a considerably older form. The JDK
 1.7 one is more mature.

 --
 / Peter Schuller aka scode

Re: History values

2010-04-14 Thread Sylvain Lebresne

 I am new to using cassandra. In the documentation I have read, understand,
 that as in other non-documentary databases, to update the value of a
 key-value tuple, this new value is stored with a timestamp different but
 without entirely losing the old value.
 I wonder, as I can restore the historic values that have had a particular
 field.

You can't. Upon update, the old value is lost.
From a technical standpoint, it is true that this old value is not
deleted (from disk)
right away, but it is deleted eventually by compaction (and you don't
really control
when the compactions occur).

--
Sylvain

Re: History values

2010-04-14 Thread Yésica Rey


Ok, thank you very much for your reply.
I have another question may seem stupid ... Cassandra has a graphical 
console, such as mysql for SQL databases?


Regards!

Re: History values

2010-04-14 Thread Bertil Chapuis

I'm also new to cassandra and about the same question I asked me if using
super columns with one key per version was feasible. Is there limitations to
this use case (or better practices)?

Thank you and best regards,

Bertil Chapuis

On 14 April 2010 09:45, Sylvain Lebresne sylv...@yakaz.com wrote:

  I am new to using cassandra. In the documentation I have read,
 understand,
  that as in other non-documentary databases, to update the value of a
  key-value tuple, this new value is stored with a timestamp different but
  without entirely losing the old value.
  I wonder, as I can restore the historic values that have had a particular
  field.

 You can't. Upon update, the old value is lost.
 From a technical standpoint, it is true that this old value is not
 deleted (from disk)
 right away, but it is deleted eventually by compaction (and you don't
 really control
 when the compactions occur).

 --
 Sylvain

Re: History values

2010-04-14 Thread Zhiguo Zhang

I think it is still to young, and have to wait or write your self the
graphical console, at least, I don't find any until now.

On Wed, Apr 14, 2010 at 10:04 AM, Bertil Chapuis bchap...@gmail.com wrote:

 I'm also new to cassandra and about the same question I asked me if using
 super columns with one key per version was feasible. Is there limitations to
 this use case (or better practices)?

 Thank you and best regards,

 Bertil Chapuis

 On 14 April 2010 09:45, Sylvain Lebresne sylv...@yakaz.com wrote:

  I am new to using cassandra. In the documentation I have read,
 understand,
  that as in other non-documentary databases, to update the value of a
  key-value tuple, this new value is stored with a timestamp different but
  without entirely losing the old value.
  I wonder, as I can restore the historic values that have had a
 particular
  field.

 You can't. Upon update, the old value is lost.
 From a technical standpoint, it is true that this old value is not
 deleted (from disk)
 right away, but it is deleted eventually by compaction (and you don't
 really control
 when the compactions occur).

 --
 Sylvain

Re: History values

2010-04-14 Thread aXqd

On Wed, Apr 14, 2010 at 5:13 PM, Zhiguo Zhang mikewolfx...@gmail.com wrote:
 I think it is still to young, and have to wait or write your self the
 graphical console, at least, I don't find any until now.

Frankly speaking, I'm OK to be without GUI...But I am really
disappointed by those so-called 'documents'.
I really prefer to have some more documents in real 'English' and in a
more tutorial way.
Hope I can write some texts after I managed to understand the current ones.


 On Wed, Apr 14, 2010 at 10:04 AM, Bertil Chapuis bchap...@gmail.com wrote:

 I'm also new to cassandra and about the same question I asked me if using
 super columns with one key per version was feasible. Is there limitations to
 this use case (or better practices)?
 Thank you and best regards,
 Bertil Chapuis
 On 14 April 2010 09:45, Sylvain Lebresne sylv...@yakaz.com wrote:

  I am new to using cassandra. In the documentation I have read,
  understand,
  that as in other non-documentary databases, to update the value of a
  key-value tuple, this new value is stored with a timestamp different
  but
  without entirely losing the old value.
  I wonder, as I can restore the historic values that have had a
  particular
  field.

 You can't. Upon update, the old value is lost.
 From a technical standpoint, it is true that this old value is not
 deleted (from disk)
 right away, but it is deleted eventually by compaction (and you don't
 really control
 when the compactions occur).

 --
 Sylvain

server crash - how to invertigate

2010-04-14 Thread Ran Tavory

I'm running a 0.6.0 cluster with four nodes and one of them just crashed.

The logs all seem normal and I haven't seen anything special in the jmx
counters before the crash.

I have one client writing and reading using 10 threads and using 3 different
column families: KvAds, KvImpressions and KvUsers

the client had got a few UnavailableException, TimedOutException and
TTransportException but was able to complete the read/write operation by
failing over to another available host. I can't tell if the exceptions were
from the crashed host or from other hosts in the ring.

Any hints how to investigate this are greatly appreciated. So far I'm
lost...

Here's a snippet from the log just before it went down. It doesn't seem to
have anything special in it, everything is INFO level.

The only thing that seems a bit strange is that last message: Compacting [].
This message usually comes with things inside the [], such as Compacting
[org.apache.cassandra.io.SSTableReader(path='/outbrain/cassdata/data/system/LocationInfo-1-Data.db'),...]
but this time it was just empty.
However, this is not the only place in the log were I see an empty
Compacting []. There are other places and they didn't end up in a crash, so
I don't know if it's related.

here's the log:
 INFO [ROW-MUTATION-STAGE:6] 2010-04-14 05:55:07,014 ColumnFamilyStore.java
(line 357) KvImpressions has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/outbrain/cassdata/commitlog/CommitLog-1271238432773.log',
position=68606651)
 INFO [ROW-MUTATION-STAGE:6] 2010-04-14 05:55:07,015 ColumnFamilyStore.java
(line 609) Enqueuing flush of Memtable(KvImpressions)@258729366
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:55:07,015 Memtable.java (line 148)
Writing Memtable(KvImpressions)@258729366
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:55:10,130 Memtable.java (line 162)
Completed flushing
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-24-Data.db
 INFO [COMMIT-LOG-WRITER] 2010-04-14 05:55:10,154 CommitLog.java (line 407)
Discarding obsolete commit
log:CommitLogSegment(/outbrain/cassdata/commitlog/CommitLog-1271238049425.log)
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,415
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-16-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,440
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-8-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,454
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-10-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,526
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-5-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,585
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-11-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,602
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-11-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,614
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvAds-9-Data.db
 INFO [SSTABLE-CLEANUP-TIMER] 2010-04-14 05:55:28,682
SSTableDeletingReference.java (line 104) Deleted
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-21-Data.db
 INFO [COMMIT-LOG-WRITER] 2010-04-14 05:55:52,254 CommitLogSegment.java
(line 50) Creating new commitlog segment
/outbrain/cassdata/commitlog/CommitLog-1271238952254.log
 INFO [ROW-MUTATION-STAGE:16] 2010-04-14 05:56:25,347 ColumnFamilyStore.java
(line 357) KvImpressions has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/outbrain/cassdata/commitlog/CommitLog-1271238952254.log',
position=47568158)
 INFO [ROW-MUTATION-STAGE:16] 2010-04-14 05:56:25,348 ColumnFamilyStore.java
(line 609) Enqueuing flush of Memtable(KvImpressions)@1955587316
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:56:25,348 Memtable.java (line 148)
Writing Memtable(KvImpressions)@1955587316
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:56:30,572 Memtable.java (line 162)
Completed flushing
/outbrain/cassdata/data/outbrain_kvdb/KvImpressions-25-Data.db
 INFO [COMMIT-LOG-WRITER] 2010-04-14 05:57:26,790 CommitLogSegment.java
(line 50) Creating new commitlog segment
/outbrain/cassdata/commitlog/CommitLog-1271239046790.log
 INFO [ROW-MUTATION-STAGE:7] 2010-04-14 05:57:59,513 ColumnFamilyStore.java
(line 357) KvImpressions has reached its threshold; switching in a fresh
Memtable at
CommitLogContext(file='/outbrain/cassdata/commitlog/CommitLog-1271239046790.log',
position=24265615)
 INFO [ROW-MUTATION-STAGE:7] 2010-04-14 05:57:59,513 ColumnFamilyStore.java
(line 609) Enqueuing flush of Memtable(KvImpressions)@1617250066
 INFO [FLUSH-WRITER-POOL:1] 2010-04-14 05:57:59,513 Memtable.java (line 148)
Writing Memtable(KvImpressions)@1617250066
 INFO [FLUSH-WRITER-POOL:1]

Re: RE : Re: RE : Re: Two dimensional matrices

2010-04-14 Thread Philippe

 I'm confused : don't range queries such as the ones we've been

  discussing require using an orderedpartitionner ?

 Alright, so distribution depends on your choice of token.

Ah yes, I get it now : with a naive orderedpartitioner, the key is
associated with the node whose token is the closest numerically-wise and
that is where the master replica is located. Yes ?

Now let's assume I am using super columns as {X} and columns as {timeFrame}.
In time each row will grow very large because X can (very sparsly) go to
2^28
i) does cassandra load all columns everytime it reads a row ? Same question
for super column
ii) Similarly does it cache all columns in memory ?

Now some order of magnitudes, let's say a row is about 20KB and the cluster
is running smoothly on low-end servers. There are millions of rows per node.
i) If I were to only issue gets on the key, what is the order of magnitude I
can expect to reach : 10/s, 100/s, 1000/s or 10.000/s ?
ii) If I were to issue a slice on just the keys, does cassandra optimize the
gets or does it run every get on the server and then concatenate to send to
the client ?
iii) is slicing on the columns going to improve the time to get the data on
the server side or does it just cut down on network traffic ?

Thanks
Philippe

Re: History values

2010-04-14 Thread Jonathan Ellis

The closest is http://github.com/driftx/chiton

On Wed, Apr 14, 2010 at 2:57 AM, Yésica Rey yes...@gdtic.es wrote:
 Ok, thank you very much for your reply.
 I have another question may seem stupid ... Cassandra has a graphical
 console, such as mysql for SQL databases?

 Regards!

Time-series data model

2010-04-14 Thread Jean-Pierre Bergamin

Hello everyone

We are currently evaluating a new DB system (replacing MySQL) to store
massive amounts of time-series data. The data are various metrics from
various network and IT devices and systems. Metrics i.e. could be CPU usage
of the server xy in percent, memory usage of server xy in MB, ping
response time of server foo in milliseconds, network traffic of router
bar in MB/s and so on. Different metrics can be collected for different
devices in different intervals.

The metrics are stored together with a timestamp. The queries we want to
perform are:
 * The last value of a specific metric of a device
 * The values of a specific metric of a device between two timestamps t1 and
t2

I stumbled across this blog post which describes a very similar setup with
Cassandra:
https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
This post gave me confidence that what we want is definitively doable with
Cassandra.

But since I'm just digging into columns and super-columns and their
families, I still have some problems understanding everything.

Our data model could look in json'isch notation like this:
{
my_server_1: {
cpu_usage: {
{ts: 1271248215, value: 87 },
{ts: 1271248220, value: 34 },
{ts: 1271248225, value: 23 },
{ts: 1271248230, value: 49 }
}
ping_response: {
{ts: 1271248201, value: 0.345 },
{ts: 1271248211, value: 0.423 },
{ts: 1271248221, value: 0.311 },
{ts: 1271248232, value: 0.582 }
}
}

my_server_2: {
cpu_usage: {
{ts: 1271248215, value: 23 },
...
}
disk_usage: {
{ts: 1271243451, value: 123445 },
...
}
}

my_router_1: {
bytes_in: {
{ts: 1271243451, value: 2452346 },
...
}
bytes_out: {
{ts: 1271243451, value: 13468 },
...
}
errors: {
{ts: 1271243451, value: 24 },
...
}
}
}

What I don't get is how to created the two level hierarchy [device][metric].

Am I right that the devices would be kept in a super column family? The
ordering of those is not important.

But the metrics per device are also a super column, where the columns would
be the metric values ({ts: 1271243451, value: 24 }), isn't it?

So I'd need a super column in a super column... Hm.
My brain is definitively RDBMS-damaged and I don't see through columns and
super-columns yet. :-)

How could this be modeled in Cassandra?


Thank you very much
James

Re: Time-series data model

2010-04-14 Thread Zhiguo Zhang

first of all I am a new bee by Non-SQL. I try write down my opinions as
references:

If I were you, I will use 2 columnfamilys:

1.CF,  key is devices
2.CF,  key is timeuuid

how do u think about that?

Mike


On Wed, Apr 14, 2010 at 3:02 PM, Jean-Pierre Bergamin ja...@ractive.chwrote:

 Hello everyone

 We are currently evaluating a new DB system (replacing MySQL) to store
 massive amounts of time-series data. The data are various metrics from
 various network and IT devices and systems. Metrics i.e. could be CPU usage
 of the server xy in percent, memory usage of server xy in MB, ping
 response time of server foo in milliseconds, network traffic of router
 bar in MB/s and so on. Different metrics can be collected for different
 devices in different intervals.

 The metrics are stored together with a timestamp. The queries we want to
 perform are:
  * The last value of a specific metric of a device
  * The values of a specific metric of a device between two timestamps t1
 and
 t2

 I stumbled across this blog post which describes a very similar setup with
 Cassandra:
 https://www.cloudkick.com/blog/2010/mar/02/4_months_with_cassandra/
 This post gave me confidence that what we want is definitively doable with
 Cassandra.

 But since I'm just digging into columns and super-columns and their
 families, I still have some problems understanding everything.

 Our data model could look in json'isch notation like this:
 {
 my_server_1: {
cpu_usage: {
{ts: 1271248215, value: 87 },
{ts: 1271248220, value: 34 },
{ts: 1271248225, value: 23 },
{ts: 1271248230, value: 49 }
}
ping_response: {
{ts: 1271248201, value: 0.345 },
{ts: 1271248211, value: 0.423 },
{ts: 1271248221, value: 0.311 },
{ts: 1271248232, value: 0.582 }
}
 }

 my_server_2: {
cpu_usage: {
{ts: 1271248215, value: 23 },
...
}
disk_usage: {
{ts: 1271243451, value: 123445 },
...
}
 }

 my_router_1: {
bytes_in: {
{ts: 1271243451, value: 2452346 },
...
}
bytes_out: {
{ts: 1271243451, value: 13468 },
...
}
errors: {
{ts: 1271243451, value: 24 },
...
}
 }
 }

 What I don't get is how to created the two level hierarchy
 [device][metric].

 Am I right that the devices would be kept in a super column family? The
 ordering of those is not important.

 But the metrics per device are also a super column, where the columns would
 be the metric values ({ts: 1271243451, value: 24 }), isn't it?

 So I'd need a super column in a super column... Hm.
 My brain is definitively RDBMS-damaged and I don't see through columns and
 super-columns yet. :-)

 How could this be modeled in Cassandra?


 Thank you very much
 James

Re: Time-series data model

2010-04-14 Thread Ted Zlatanov

On Wed, 14 Apr 2010 15:02:29 +0200 Jean-Pierre Bergamin ja...@ractive.ch 
wrote: 

JB The metrics are stored together with a timestamp. The queries we want to
JB perform are:
JB  * The last value of a specific metric of a device
JB  * The values of a specific metric of a device between two timestamps t1 and
JB t2

Make your key devicename-metricname-MMDD-HHMM (with whatever time
sharding makes sense to you; I use UTC by-hours and by-day in my
environment).  Then your supercolumn is the collection time as a
LongType and your columns inside the supercolumn can express the metric
in detail (collector agent, detailed breakdown, etc.).

If you want your clients to discover the available metrics, you may need
to keep an external index.  But from your spec that doesn't seem necessary.

Ted

Re: Reading thousands of columns

2010-04-14 Thread Gautam Singaraju

Yes, I find that get_range_slices takes an incredibly long time return
the results.
---
Gautam



On Tue, Apr 13, 2010 at 2:00 PM, James Golick jamesgol...@gmail.com wrote:
 Hi All,
 I'm seeing about 35-50ms to read 1000 columns from a CF using
 get_range_slices. The columns are TimeUUIDType with empty values.
 The row cache is enabled and I'm running the query 500 times in a row, so I
 can only assume the row is cached.
 Is that about what's expected or am I doing something wrong? (It's from java
 this time, so it's not ruby thrift being slow).
 - James

Re: Reading thousands of columns

2010-04-14 Thread Jonathan Ellis

35-50ms for how many rows of 1000 columns each?

get_range_slices does not use the row cache, for the same reason that
oracle doesn't cache tuples from sequential scans -- blowing away
1000s of rows worth of recently used rows queried by key, for a swath
of rows from the scan, is the wrong call more often than it is the
right one.

On Tue, Apr 13, 2010 at 1:00 PM, James Golick jamesgol...@gmail.com wrote:
 Hi All,
 I'm seeing about 35-50ms to read 1000 columns from a CF using
 get_range_slices. The columns are TimeUUIDType with empty values.
 The row cache is enabled and I'm running the query 500 times in a row, so I
 can only assume the row is cached.
 Is that about what's expected or am I doing something wrong? (It's from java
 this time, so it's not ruby thrift being slow).
 - James

Re: [RELEASE] 0.6.0

2010-04-14 Thread Ted Zlatanov

On Tue, 13 Apr 2010 15:54:39 -0500 Eric Evans eev...@rackspace.com wrote: 

EE I leaned into it. An updated package has been uploaded to the Cassandra
EE repo (see: http://wiki.apache.org/cassandra/DebianPackaging).

Thank you for providing the release to the repository.

Can it support a non-root user through /etc/default/cassandra?  I've
been patching the init script myself but was hoping this would be
standard.

Thanks
Ted

KeysCached and sstable

2010-04-14 Thread Paul Prescod

The inline docs say:

   ~ The optional KeysCached attribute specifies
   ~ the number of keys per sstable whose locations we keep in
   ~ memory in mostly LRU order.

There are a few confusing bits in that sentence.

 1. Why is keys per sstable rather than keys per column family. If
I have 7 SSTable files and I set KeysCached to 1, will I have
7 keys cached? If so, why? What is the logical relationship here?

 2. What makes the algorithm mostly LRU rather than just LRU?

 3. Is it accurate the say that the goal of the Key Cache is to avoid
looking through a bunch off SSTable's Bloom Filters? (how big do the
bloom filters grow to...too much to be cached themselves?)

I'd like to document the detail.

 Paul Prescod

Re: Reading thousands of columns

2010-04-14 Thread James Golick

Right - that make sense. I'm only fetching one row. I'll give it a try with
get_slice().

Thanks,

-James

On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis jbel...@gmail.com wrote:

 35-50ms for how many rows of 1000 columns each?

 get_range_slices does not use the row cache, for the same reason that
 oracle doesn't cache tuples from sequential scans -- blowing away
 1000s of rows worth of recently used rows queried by key, for a swath
 of rows from the scan, is the wrong call more often than it is the
 right one.

 On Tue, Apr 13, 2010 at 1:00 PM, James Golick jamesgol...@gmail.com
 wrote:
  Hi All,
  I'm seeing about 35-50ms to read 1000 columns from a CF using
  get_range_slices. The columns are TimeUUIDType with empty values.
  The row cache is enabled and I'm running the query 500 times in a row, so
 I
  can only assume the row is cached.
  Is that about what's expected or am I doing something wrong? (It's from
 java
  this time, so it's not ruby thrift being slow).
  - James

Re: Lucandra or some way to query

2010-04-14 Thread Eric Evans

On Wed, 2010-04-14 at 06:45 -0300, Jesus Ibanez wrote:
 Option 1 - insert data in all different ways I need in order to be
 able to query?

Rolling your own indexes is fairly common with Cassandra.

 Option 2 - implement Lucandra? Can you link me to a blog or an article
 that guides me on how to implement Lucandra?

I would recommend you explore this route a little further. I've never
used Lucandra so I can't be of help, but the author is active. Have you
tried submitting an issue on the github project page?

 Option 3 - switch to an SQL database? (I hope not). 

If your requirements can be met with an SQL database, then sure, why
not?

-- 
Eric Evans
eev...@rackspace.com

Re: [RELEASE] 0.6.0

2010-04-14 Thread Eric Evans

On Wed, 2010-04-14 at 10:16 -0500, Ted Zlatanov wrote:
 Can it support a non-root user through /etc/default/cassandra?  I've
 been patching the init script myself but was hoping this would be
 standard. 

It's the first item on debian/TODO, but, you know, patches welcome and
all that.

-- 
Eric Evans
eev...@rackspace.com

Re: Reading thousands of columns

2010-04-14 Thread Mike Malone

On Wed, Apr 14, 2010 at 7:45 AM, Jonathan Ellis jbel...@gmail.com wrote:

 35-50ms for how many rows of 1000 columns each?

 get_range_slices does not use the row cache, for the same reason that
 oracle doesn't cache tuples from sequential scans -- blowing away
 1000s of rows worth of recently used rows queried by key, for a swath
 of rows from the scan, is the wrong call more often than it is the
 right one.


Couldn't you cache a list of keys that were returned for the key range, then
cache individual rows separately or not at all?

By blowing away rows queried by key I'm guessing you mean pushing them
out of the LRU cache, not explicitly blowing them away? Either way I'm not
entirely convinced. In my experience I've had pretty good success caching
items that were pulled out via more complicated join / range type queries.
If your system is doing lots of range quereis, and not a lot of lookups by
key, you'd obviously see a performance win from caching the range queries.
Maybe range scan caching could be turned on separately?

Mike

Re: History values

2010-04-14 Thread Paul Prescod

If you want to use Cassandra, you should probably store each
historical value as a new column in the row.

On Wed, Apr 14, 2010 at 12:34 AM, Yésica Rey yes...@gdtic.es wrote:
 I am new to using cassandra. In the documentation I have read, understand,
 that as in other non-documentary databases, to update the value of a
 key-value tuple, this new value is stored with a timestamp different but
 without entirely losing the old value.
 I wonder, as I can restore the historic values that have had a particular
 field.
 Greetings and thanks

Re: Reading thousands of columns

2010-04-14 Thread Paul Prescod

On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone m...@simplegeo.com wrote:
 ...

 Couldn't you cache a list of keys that were returned for the key range, then
 cache individual rows separately or not at all?
 By blowing away rows queried by key I'm guessing you mean pushing them
 out of the LRU cache, not explicitly blowing them away? Either way I'm not
 entirely convinced. In my experience I've had pretty good success caching
 items that were pulled out via more complicated join / range type queries.
 If your system is doing lots of range quereis, and not a lot of lookups by
 key, you'd obviously see a performance win from caching the range queries.
 Maybe range scan caching could be turned on separately?

I agree with you that the caches should be separate, if you're going
to cache ranges. You could imagine a single query (perhaps entered
interactively) would replace the entire row caching all of the data
for the systems' interactive users. For example, a summary page of who
is most over the last month active could replace the profile
information for the actual users who are using the system at that
moment.

 Paul Prescod

Re: Reading thousands of columns

2010-04-14 Thread James Golick

The values are empty. It's 3000 UUIDs.

On Wed, Apr 14, 2010 at 12:40 PM, Avinash Lakshman 
avinash.laksh...@gmail.com wrote:

 How large are the values? How much data on disk?

 On Wednesday, April 14, 2010, James Golick jamesgol...@gmail.com wrote:
  Just for the record, I am able to repeat this locally.
  I'm seeing around 150ms to read 1000 columns from a row that has 3000 in
 it. If I enable the rowcache, that goes down to about 90ms. According to my
 profile, 90% of the time is being spent waiting for cassandra to respond, so
 it's not thrift.
 
  On Wed, Apr 14, 2010 at 11:01 AM, Paul Prescod pres...@gmail.com
 wrote:
 
  On Wed, Apr 14, 2010 at 10:31 AM, Mike Malone m...@simplegeo.com
 wrote:
  ...
 
  Couldn't you cache a list of keys that were returned for the key range,
 then
  cache individual rows separately or not at all?
  By blowing away rows queried by key I'm guessing you mean pushing
 them
  out of the LRU cache, not explicitly blowing them away? Either way I'm
 not
  entirely convinced. In my experience I've had pretty good success
 caching
  items that were pulled out via more complicated join / range type
 queries.
  If your system is doing lots of range quereis, and not a lot of lookups
 by
  key, you'd obviously see a performance win from caching the range
 queries.
  Maybe range scan caching could be turned on separately?
 
  I agree with you that the caches should be separate, if you're going
  to cache ranges. You could imagine a single query (perhaps entered
  interactively) would replace the entire row caching all of the data
  for the systems' interactive users. For example, a summary page of who
  is most over the last month active could replace the profile
  information for the actual users who are using the system at that
  moment.
 
   Paul Prescod

Re: Lucandra or some way to query

2010-04-14 Thread Jake Luciani

Hi,

What doesn't work with lucandra exactly?  Feel free to msg me.

-Jake

On Wed, Apr 14, 2010 at 9:30 PM, Jesus Ibanez jesusiba...@gmail.com wrote:

 I will explore Lucandra a little more and if I can't get it to work today,
 I will go for Option 2.
 Using SQL will not be efficient in the future, if my website grows.

 Thenks for your answer Eric!

 Jesús.


 2010/4/14 Eric Evans eev...@rackspace.com

 On Wed, 2010-04-14 at 06:45 -0300, Jesus Ibanez wrote:
  Option 1 - insert data in all different ways I need in order to be
  able to query?

 Rolling your own indexes is fairly common with Cassandra.

  Option 2 - implement Lucandra? Can you link me to a blog or an article
  that guides me on how to implement Lucandra?

 I would recommend you explore this route a little further. I've never
 used Lucandra so I can't be of help, but the author is active. Have you
 tried submitting an issue on the github project page?

  Option 3 - switch to an SQL database? (I hope not).

 If your requirements can be met with an SQL database, then sure, why
 not?

 --
 Eric Evans
 eev...@rackspace.com

Re: Lucandra or some way to query

2010-04-14 Thread HubertChang


If you worked with Lucandra in a dedicated searching-purposed cluster, you
could balanced the data very well with some effort. 
I think Lucandra is really a great idea, but since it needs
order-preserving-partitioner, does that mean there may be some 'hot-spot'
during searching?
-- 
View this message in context: 
http://n2.nabble.com/Lucandra-or-some-way-to-query-tp4900727p4905149.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Ken Sandney

Large files can be split into small blocks, and the size of block can be
tuned. It may increase the complexity of writing such a file system, but can
be for general purpose (not only for relative small files)

On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta tsalora...@gmail.comwrote:

 On Wed, Apr 14, 2010 at 6:42 PM, Zhuguo Shi bluefl...@gmail.com wrote:
  Hi,
  Cassandra has a good distributed model: decentralized, auto-partition,
  auto-recovery. I am evaluating about writing a file system over Cassandra
  (like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
  Cassandra is good at such use case?

 It sort of depends on what you are looking for. From use case for
 which something like S3 is good, yes, except with one difference:
 Cassandra is more geared towards lots of small files, whereas S3 is
 more geared towards moderate number of files (possibly large).

 So I think it can definitely be a good use case, and I may use
 Cassandra for this myself in future. Having range queries allows
 implementing directory/path structures (list keys using path as
 prefix). And you can split storage such that metadata could live in
 OPP partition, raw data in RP.

 -+ Tatu +-

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Miguel Verde

On Wed, Apr 14, 2010 at 9:15 PM, Ken Sandney bluefl...@gmail.com wrote:

 Large files can be split into small blocks, and the size of block can be
 tuned. It may increase the complexity of writing such a file system, but can
 be for general purpose (not only for relative small files)


 Right, this is the path that MongoDB has taken with GridFS:
http://www.mongodb.org/display/DOCS/GridFS+Specification

I don't have any use for such a filesystem, but if I were to design one I
would probably mostly follow Tatu's suggestions:


  On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta tsalora...@gmail.comwrote:

 So I think it can definitely be a good use case, and I may use
 Cassandra for this myself in future. Having range queries allows
 implementing directory/path structures (list keys using path as
 prefix). And you can split storage such that metadata could live in
 OPP partition, raw data in RP.


but using OPP for all data, using prefixed metadata, and UUID_chunk# for
keys in the chunk CF.

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Avinash Lakshman

Exactly. You can split a file into blocks of any size and you can actually
distribute the metadata across a large set of machines. You wouldn't have
the issue of having small files in this approach. The issue maybe the
eventual consistency - not sure that is a paradigm that would be acceptable
for a file system. But that is a discussion for another time/day.

Avinash

On Wed, Apr 14, 2010 at 7:15 PM, Ken Sandney bluefl...@gmail.com wrote:

 Large files can be split into small blocks, and the size of block can be
 tuned. It may increase the complexity of writing such a file system, but can
 be for general purpose (not only for relative small files)


 On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta tsalora...@gmail.comwrote:

 On Wed, Apr 14, 2010 at 6:42 PM, Zhuguo Shi bluefl...@gmail.com wrote:
  Hi,
  Cassandra has a good distributed model: decentralized, auto-partition,
  auto-recovery. I am evaluating about writing a file system over
 Cassandra
  (like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
  Cassandra is good at such use case?

 It sort of depends on what you are looking for. From use case for
 which something like S3 is good, yes, except with one difference:
 Cassandra is more geared towards lots of small files, whereas S3 is
 more geared towards moderate number of files (possibly large).

 So I think it can definitely be a good use case, and I may use
 Cassandra for this myself in future. Having range queries allows
 implementing directory/path structures (list keys using path as
 prefix). And you can split storage such that metadata could live in
 OPP partition, raw data in RP.

 -+ Tatu +-

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Avinash Lakshman

OPP is not required here. You would be better off using a Random partitioner
because you want to get a random distribution of the metadata.

Avinash

On Wed, Apr 14, 2010 at 7:25 PM, Avinash Lakshman 
avinash.laksh...@gmail.com wrote:

 Exactly. You can split a file into blocks of any size and you can actually
 distribute the metadata across a large set of machines. You wouldn't have
 the issue of having small files in this approach. The issue maybe the
 eventual consistency - not sure that is a paradigm that would be acceptable
 for a file system. But that is a discussion for another time/day.

 Avinash

 On Wed, Apr 14, 2010 at 7:15 PM, Ken Sandney bluefl...@gmail.com wrote:

 Large files can be split into small blocks, and the size of block can be
 tuned. It may increase the complexity of writing such a file system, but can
 be for general purpose (not only for relative small files)


 On Thu, Apr 15, 2010 at 10:08 AM, Tatu Saloranta tsalora...@gmail.comwrote:

 On Wed, Apr 14, 2010 at 6:42 PM, Zhuguo Shi bluefl...@gmail.com wrote:
  Hi,
  Cassandra has a good distributed model: decentralized, auto-partition,
  auto-recovery. I am evaluating about writing a file system over
 Cassandra
  (like CassFS: http://github.com/jdarcy/CassFS ), but I don't know if
  Cassandra is good at such use case?

 It sort of depends on what you are looking for. From use case for
 which something like S3 is good, yes, except with one difference:
 Cassandra is more geared towards lots of small files, whereas S3 is
 more geared towards moderate number of files (possibly large).

 So I think it can definitely be a good use case, and I may use
 Cassandra for this myself in future. Having range queries allows
 implementing directory/path structures (list keys using path as
 prefix). And you can split storage such that metadata could live in
 OPP partition, raw data in RP.

 -+ Tatu +-

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Miguel Verde

On Wed, Apr 14, 2010 at 9:26 PM, Avinash Lakshman 
avinash.laksh...@gmail.com wrote:

 OPP is not required here. You would be better off using a Random
 partitioner because you want to get a random distribution of the metadata.


Not required, certainly.  However, it strikes me that 1 cluster is better
than 2, and most consumers of a filesystem would expect to be able to get an
ordered listing or tree of the metadata which is easy using the OPP row key
pattern listed previously.  You could still do this with the Random
partitioner using column names in rows to describe the structure but the
current compaction limitations could be an issue if a branch becomes too
large, and you'd still have a root row hotspot (at least in the schema which
comes to mind).

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread HubertChang


Note: there are glusterfs, ceph, brtfs and luster. there is drbd.
-- 
View this message in context: 
http://n2.nabble.com/Is-that-possible-to-write-a-file-system-over-Cassandra-tp4905111p4905312.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Michael Greene

On Wed, Apr 14, 2010 at 11:01 PM, Ken Sandney bluefl...@gmail.com wrote:

  a fuse based FS maybe better I guess


This has been done, for better or worse, by jdarcy of http://pl.atyp.us/:
http://github.com/jdarcy/CassFS

Re: Is that possible to write a file system over Cassandra?

2010-04-14 Thread Ken Sandney

tried CassFS, but not stable yet, may be a good prototype to start

On Thu, Apr 15, 2010 at 12:15 PM, Michael Greene
michael.gre...@gmail.comwrote:

 On Wed, Apr 14, 2010 at 11:01 PM, Ken Sandney bluefl...@gmail.com wrote:

  a fuse based FS maybe better I guess


 This has been done, for better or worse, by jdarcy of http://pl.atyp.us/:
 http://github.com/jdarcy/CassFS

Re: Starting Cassandra Fauna

2010-04-14 Thread Nirmala Agadgar

Hi,

I want to insert data into Cassandra programmatically in a loop.
Also  i'm a newbie to Linux world and Github. Started to work on Linux  for
only reason to implement Cassandra.Digging Cassandra for last on week.How to
insert data in cassandra and test it?
Can anyone help me out on this?

-
Nimala

Re: Starting Cassandra Fauna

2010-04-14 Thread richard yao

try this
https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP



On Thu, Apr 15, 2010 at 12:23 PM, Nirmala Agadgar nirmala...@gmail.comwrote:

 Hi,

 I want to insert data into Cassandra programmatically in a loop.
 Also  i'm a newbie to Linux world and Github. Started to work on Linux  for
 only reason to implement Cassandra.Digging Cassandra for last on week.How to
 insert data in cassandra and test it?
 Can anyone help me out on this?

 -
 Nimala

Re: Starting Cassandra Fauna

2010-04-14 Thread Paul Prescod

There is a tutorial here:

 * http://www.sodeso.nl/?p=80

This page includes data inserts:

 * http://www.sodeso.nl/?p=251

Like:

c.setColumn(new Column(email.getBytes(utf-8), ronald (at)
sodeso.nl.getBytes(utf-8), timestamp))
columns.add(c);

The Sample code is attached to that blog post.

On Wed, Apr 14, 2010 at 9:23 PM, Nirmala Agadgar nirmala...@gmail.com wrote:
 Hi,

 I want to insert data into Cassandra programmatically in a loop.
 Also  i'm a newbie to Linux world and Github. Started to work on Linux  for
 only reason to implement Cassandra.Digging Cassandra for last on week.How to
 insert data in cassandra and test it?
 Can anyone help me out on this?

 -
 Nimala

Re: Starting Cassandra Fauna

2010-04-14 Thread Nirmala Agadgar

Hi,

I'm  using ruby client as of now. Can u give details for ruby client.Also if
possible java client.
Thanks for reply.

-
Nirmala

On Thu, Apr 15, 2010 at 10:02 AM, richard yao richard.yao2...@gmail.comwrote:

 try this
 https://wiki.fourkitchens.com/display/PF/Using+Cassandra+with+PHP




 On Thu, Apr 15, 2010 at 12:23 PM, Nirmala Agadgar nirmala...@gmail.comwrote:

 Hi,

 I want to insert data into Cassandra programmatically in a loop.
 Also  i'm a newbie to Linux world and Github. Started to work on Linux
 for only reason to implement Cassandra.Digging Cassandra for last on
 week.How to insert data in cassandra and test it?
 Can anyone help me out on this?

 -
 Nimala

TException: Error: TSocket: timed out reading 1024 bytes from 10.1.1.27:9160

2010-04-14 Thread richard yao

I am having a try on cassandra, and I use php to access cassandra by thrift
API.
I got an error like this:
TException:  Error: TSocket: timed out reading 1024 bytes from
10.1.1.27:9160
What's wrong?
Thanks.

Re: Lucandra or some way to query

2010-04-14 Thread Jake Luciani

Lucandra spreads the data randomly by index + field combination so you do
get some distribution for free. Otherwise you can use nodetool
loadbalance to alter the token ring to alleviate hotspots.

On Thu, Apr 15, 2010 at 2:04 AM, HubertChang hui...@gmail.com wrote:


 If you worked with Lucandra in a dedicated searching-purposed cluster, you
 could balanced the data very well with some effort.
 I think Lucandra is really a great idea, but since it needs
 order-preserving-partitioner, does that mean there may be some 'hot-spot'
 during searching?
 --
 View this message in context:
 http://n2.nabble.com/Lucandra-or-some-way-to-query-tp4900727p4905149.html
 Sent from the cassandra-u...@incubator.apache.org mailing list archive at
 Nabble.com.

41 matches

Mail list logo