Re: Customized Compaction Strategy: Dev Questions

Russell Bradberry Wed, 04 Jun 2014 10:57:11 -0700

Well, DELETE will not free up disk space until after GC grace has passed and 
the next major compaction has run. So in essence, if you need to free up space 
right away, then creating daily/monthly tables would be one way to go.  Just 
remember to clear your snapshots after dropping though.




On June 4, 2014 at 1:54:05 PM, Redmumba (redmu...@gmail.com) wrote:

That still involves quite a bit of infrastructure work--it also means that to 
query the data, I would have to make N queries, one per table, to query for 
audit information (audit information is sorted by a key identifying the item, 
and then the date).  I don't think this would yield any benefit (to me) over 
simply tombstoning the values or creating a secondary index on date and simply 
doing a DELETE, right?

Is there something internally preventing me from implementing this as a 
separate Strategy?


On Wed, Jun 4, 2014 at 10:47 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:
I'd suggest creating 1 table per day, and dropping the tables you don't need 
once you're done.


On Wed, Jun 4, 2014 at 10:44 AM, Redmumba <redmu...@gmail.com> wrote:
Sorry, yes, that is what I was looking to do--i.e., create a 
"TopologicalCompactionStrategy" or similar.


On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry <rbradbe...@gmail.com> wrote:
Maybe I’m misunderstanding something, but what makes you think that running a 
major compaction every day will cause they data from January 1st to exist in 
only one SSTable and not have data from other days in the SSTable as well? Are 
you talking about making a new compaction strategy that creates SSTables by day?



On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:

Let's say I run a major compaction every day, so that the "oldest" sstable 
contains only the data for January 1st.  Assuming all the nodes are in-sync and 
have had at least one repair run before the table is dropped (so that all 
information for that time period is "the same"), wouldn't it be safe to assume 
that the same data would be dropped on all nodes?  There might be a period when 
the compaction is running where different nodes might have an inconsistent view 
of just that days' data (in that some would have it and others would not), but 
the cluster would still function and become eventually consistent, correct?

Also, if the entirety of the sstable is being dropped, wouldn't the tombstones 
be removed with it?  I wouldn't be concerned with individual rows and columns, 
and this is a write-only table, more or less--the only deletes that occur in 
the current system are to delete the old data.


On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry <rbradbe...@gmail.com> wrote:
I’m not sure what you want to do is feasible.  At a high level I can see you 
running into issues with RF etc.  The SSTables node to node are not identical, 
so if you drop a full SSTable on one node there is no one corresponding SSTable 
on the adjacent nodes to drop.    You would need to choose data to compact out, 
and ensure it is removed on all replicas as well.  But if your problem is that 
you’re low on disk space then you probably won’t be able to write out a new 
SSTable with the older information compacted out. Also, there is more to an 
SSTable than just data, the SSTable could have tombstones and other relics that 
haven’t been cleaned up from nodes coming or going. 




On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:

Thanks, Russell--yes, a similar concept, just applied to sstables.  I'm 
assuming this would require changes to both major compactions, and probably GC 
(to remove the old tables), but since I'm not super-familiar with the C* 
internals, I wanted to make sure it was feasible with the current toolset 
before I actually dived in and started tinkering.

Andrew


On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry <rbradbe...@gmail.com> wrote:
hmm, I see. So something similar to Capped Collections in MongoDB.



On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:

Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest sstable 
rather than simply run out of space.

The problem with using TTLs is that I have to try and guess how much data is 
being put in--since this is auditing data, the usage can vary wildly depending 
on time of year, verbosity of auditing, etc..  I'd like to maximize the disk 
space--not optimize the cleanup process.

Andrew


On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry <rbradbe...@gmail.com> wrote:
You mean this:

https://issues.apache.org/jira/browse/CASSANDRA-5228

?



On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:

Good morning!

I've asked (and seen other people ask) about the ability to drop old sstables, 
basically creating a FIFO-like clean-up process.  Since we're using Cassandra 
as an auditing system, this is particularly appealing to us because it means we 
can maximize the amount of auditing data we can keep while still allowing 
Cassandra to clear old data automatically.

My idea is this: perform compaction based on the range of dates available in 
the sstable (or just metadata about when it was created).  For example, a major 
compaction could create a combined sstable per day--so that, say, 60 days of 
data after a major compaction would contain 60 sstables.

My question then is, will this be possible by simply implementing a separate 
AbstractCompactionStrategy?  Does this sound feasilble at all?  Based on the 
implementation of Size and Leveled strategies, it looks like I would have the 
ability to control what and how things get compacted, but I wanted to verify 
before putting time into it.

Thank you so much for your time!

Andrew







--
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Customized Compaction Strategy: Dev Questions

Reply via email to