The example I gave was for when N=1, if we need to save more values I planned to just add more columns.
On Thu, Jun 9, 2016 at 12:51 AM, kurt Greaves <k...@instaclustr.com> wrote: > I would say it's probably due to a significantly larger number of > partitions when using the overwrite method - but really you should be > seeing similar performance unless one of the schemas ends up generating a > lot more disk IO. > If you're planning to read the last N values for an event at the same time > the widerow schema would be better, otherwise reading N events using the > overwrite schema will result in you hitting N partitions. You really need > to take into account how you're going to read the data when you design a > schema, not only how many writes you can push through. > > On 8 June 2016 at 19:02, John Thomas <jthom...@gmail.com> wrote: > >> We have a use case where we are storing event data for a given system and >> only want to retain the last N values. Storing extra values for some time, >> as long as it isn’t too long, is fine but never less than N. We can't use >> TTLs to delete the data because we can't be sure how frequently events will >> arrive and could end up losing everything. Is there any built in mechanism >> to accomplish this or a known pattern that we can follow? The events will >> be read and written at a pretty high frequency so the solution would have >> to be performant and not fragile under stress. >> >> >> >> We’ve played with a schema that just has N distinct columns with one >> value in each but have found overwrites seem to perform much poorer than >> wide rows. The use case we tested only required we store the most recent >> value: >> >> >> >> CREATE TABLE eventyvalue_overwrite( >> >> system_name text, >> >> event_name text, >> >> event_time timestamp, >> >> event_value blob, >> >> PRIMARY KEY (system_name,event_name)) >> >> >> >> CREATE TABLE eventvalue_widerow ( >> >> system_name text, >> >> event_name text, >> >> event_time timestamp, >> >> event_value blob, >> >> PRIMARY KEY ((system_name, event_name), event_time)) >> >> WITH CLUSTERING ORDER BY (event_time DESC) >> >> >> >> We tested it against the DataStax AMI on EC2 with 6 nodes, replication 3, >> write consistency 2, and default settings with a write only workload and >> got 190K/s for wide row and 150K/s for overwrite. Thinking through the >> write path it seems the performance should be pretty similar, with probably >> smaller sstables for the overwrite schema, can anyone explain the big >> difference? >> >> >> >> The wide row solution is more complex in that it requires a separate >> clean up thread that will handle deleting the extra values. If that’s the >> path we have to follow we’re thinking we’d add a bucket of some sort so >> that we can delete an entire partition at a time after copying some values >> forward, on the assumption that deleting the whole partition is much better >> than deleting some slice of the partition. Is that true? Also, is there >> any difference between setting a really short ttl and doing a delete? >> >> >> >> I know there are a lot of questions in there but we’ve been going back >> and forth on this for a while and I’d really appreciate any help you could >> give. >> >> >> >> Thanks, >> >> John >> > > > > -- > Kurt Greaves > k...@instaclustr.com > www.instaclustr.com >