Re: Customized Compaction Strategy: Dev Questions

Jonathan Haddad Wed, 04 Jun 2014 10:48:37 -0700

I'd suggest creating 1 table per day, and dropping the tables you don't
need once you're done.



On Wed, Jun 4, 2014 at 10:44 AM, Redmumba <redmu...@gmail.com> wrote:

> Sorry, yes, that is what I was looking to do--i.e., create a
> "TopologicalCompactionStrategy" or similar.
>
>
> On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry <rbradbe...@gmail.com>
> wrote:
>
>> Maybe I’m misunderstanding something, but what makes you think that
>> running a major compaction every day will cause they data from January 1st
>> to exist in only one SSTable and not have data from other days in the
>> SSTable as well? Are you talking about making a new compaction strategy
>> that creates SSTables by day?
>>
>>
>>
>> On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:
>>
>>  Let's say I run a major compaction every day, so that the "oldest"
>> sstable contains only the data for January 1st.  Assuming all the nodes are
>> in-sync and have had at least one repair run before the table is dropped
>> (so that all information for that time period is "the same"), wouldn't it
>> be safe to assume that the same data would be dropped on all nodes?  There
>> might be a period when the compaction is running where different nodes
>> might have an inconsistent view of just that days' data (in that some would
>> have it and others would not), but the cluster would still function and
>> become eventually consistent, correct?
>>
>> Also, if the entirety of the sstable is being dropped, wouldn't the
>> tombstones be removed with it?  I wouldn't be concerned with individual
>> rows and columns, and this is a write-only table, more or less--the only
>> deletes that occur in the current system are to delete the old data.
>>
>>
>> On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry <rbradbe...@gmail.com>
>> wrote:
>>
>>>  I’m not sure what you want to do is feasible.  At a high level I can
>>> see you running into issues with RF etc.  The SSTables node to node are not
>>> identical, so if you drop a full SSTable on one node there is no one
>>> corresponding SSTable on the adjacent nodes to drop.    You would need to
>>> choose data to compact out, and ensure it is removed on all replicas as
>>> well.  But if your problem is that you’re low on disk space then you
>>> probably won’t be able to write out a new SSTable with the older
>>> information compacted out. Also, there is more to an SSTable than just
>>> data, the SSTable could have tombstones and other relics that haven’t been
>>> cleaned up from nodes coming or going.
>>>
>>>
>>>
>>>
>>> On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:
>>>
>>>   Thanks, Russell--yes, a similar concept, just applied to sstables.
>>> I'm assuming this would require changes to both major compactions, and
>>> probably GC (to remove the old tables), but since I'm not super-familiar
>>> with the C* internals, I wanted to make sure it was feasible with the
>>> current toolset before I actually dived in and started tinkering.
>>>
>>> Andrew
>>>
>>>
>>> On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry <rbradbe...@gmail.com
>>> > wrote:
>>>
>>>>  hmm, I see. So something similar to Capped Collections in MongoDB.
>>>>
>>>>
>>>>
>>>> On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:
>>>>
>>>>   Not quite; if I'm at say 90% disk usage, I'd like to drop the oldest
>>>> sstable rather than simply run out of space.
>>>>
>>>> The problem with using TTLs is that I have to try and guess how much
>>>> data is being put in--since this is auditing data, the usage can vary
>>>> wildly depending on time of year, verbosity of auditing, etc..  I'd like to
>>>> maximize the disk space--not optimize the cleanup process.
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry <rbradbe...@gmail.com
>>>> > wrote:
>>>>
>>>>>  You mean this:
>>>>>
>>>>>  https://issues.apache.org/jira/browse/CASSANDRA-5228
>>>>>
>>>>>  ?
>>>>>
>>>>>
>>>>>
>>>>> On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:
>>>>>
>>>>>   Good morning!
>>>>>
>>>>> I've asked (and seen other people ask) about the ability to drop old
>>>>> sstables, basically creating a FIFO-like clean-up process.  Since we're
>>>>> using Cassandra as an auditing system, this is particularly appealing to 
>>>>> us
>>>>> because it means we can maximize the amount of auditing data we can keep
>>>>> while still allowing Cassandra to clear old data automatically.
>>>>>
>>>>> My idea is this: perform compaction based on the range of dates
>>>>> available in the sstable (or just metadata about when it was created).  
>>>>> For
>>>>> example, a major compaction could create a combined sstable per day--so
>>>>> that, say, 60 days of data after a major compaction would contain 60
>>>>> sstables.
>>>>>
>>>>> My question then is, will this be possible by simply implementing a
>>>>> separate AbstractCompactionStrategy?  Does this sound feasilble at all?
>>>>> Based on the implementation of Size and Leveled strategies, it looks like 
>>>>> I
>>>>> would have the ability to control what and how things get compacted, but I
>>>>> wanted to verify before putting time into it.
>>>>>
>>>>> Thank you so much for your time!
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade

Re: Customized Compaction Strategy: Dev Questions

Reply via email to