That still involves quite a bit of infrastructure work--it also means that
to query the data, I would have to make N queries, one per table, to query
for audit information (audit information is sorted by a key identifying the
item, and then the date).  I don't think this would yield any benefit (to
me) over simply tombstoning the values or creating a secondary index on
date and simply doing a DELETE, right?

Is there something internally preventing me from implementing this as a
separate Strategy?


On Wed, Jun 4, 2014 at 10:47 AM, Jonathan Haddad <j...@jonhaddad.com> wrote:

> I'd suggest creating 1 table per day, and dropping the tables you don't
> need once you're done.
>
>
> On Wed, Jun 4, 2014 at 10:44 AM, Redmumba <redmu...@gmail.com> wrote:
>
>> Sorry, yes, that is what I was looking to do--i.e., create a
>> "TopologicalCompactionStrategy" or similar.
>>
>>
>> On Wed, Jun 4, 2014 at 10:40 AM, Russell Bradberry <rbradbe...@gmail.com>
>> wrote:
>>
>>> Maybe I’m misunderstanding something, but what makes you think that
>>> running a major compaction every day will cause they data from January 1st
>>> to exist in only one SSTable and not have data from other days in the
>>> SSTable as well? Are you talking about making a new compaction strategy
>>> that creates SSTables by day?
>>>
>>>
>>>
>>> On June 4, 2014 at 1:36:10 PM, Redmumba (redmu...@gmail.com) wrote:
>>>
>>>  Let's say I run a major compaction every day, so that the "oldest"
>>> sstable contains only the data for January 1st.  Assuming all the nodes are
>>> in-sync and have had at least one repair run before the table is dropped
>>> (so that all information for that time period is "the same"), wouldn't it
>>> be safe to assume that the same data would be dropped on all nodes?  There
>>> might be a period when the compaction is running where different nodes
>>> might have an inconsistent view of just that days' data (in that some would
>>> have it and others would not), but the cluster would still function and
>>> become eventually consistent, correct?
>>>
>>> Also, if the entirety of the sstable is being dropped, wouldn't the
>>> tombstones be removed with it?  I wouldn't be concerned with individual
>>> rows and columns, and this is a write-only table, more or less--the only
>>> deletes that occur in the current system are to delete the old data.
>>>
>>>
>>> On Wed, Jun 4, 2014 at 10:24 AM, Russell Bradberry <rbradbe...@gmail.com
>>> > wrote:
>>>
>>>>  I’m not sure what you want to do is feasible.  At a high level I can
>>>> see you running into issues with RF etc.  The SSTables node to node are not
>>>> identical, so if you drop a full SSTable on one node there is no one
>>>> corresponding SSTable on the adjacent nodes to drop.    You would need to
>>>> choose data to compact out, and ensure it is removed on all replicas as
>>>> well.  But if your problem is that you’re low on disk space then you
>>>> probably won’t be able to write out a new SSTable with the older
>>>> information compacted out. Also, there is more to an SSTable than just
>>>> data, the SSTable could have tombstones and other relics that haven’t been
>>>> cleaned up from nodes coming or going.
>>>>
>>>>
>>>>
>>>>
>>>> On June 4, 2014 at 1:10:58 PM, Redmumba (redmu...@gmail.com) wrote:
>>>>
>>>>   Thanks, Russell--yes, a similar concept, just applied to sstables.
>>>> I'm assuming this would require changes to both major compactions, and
>>>> probably GC (to remove the old tables), but since I'm not super-familiar
>>>> with the C* internals, I wanted to make sure it was feasible with the
>>>> current toolset before I actually dived in and started tinkering.
>>>>
>>>> Andrew
>>>>
>>>>
>>>> On Wed, Jun 4, 2014 at 10:04 AM, Russell Bradberry <
>>>> rbradbe...@gmail.com> wrote:
>>>>
>>>>>  hmm, I see. So something similar to Capped Collections in MongoDB.
>>>>>
>>>>>
>>>>>
>>>>> On June 4, 2014 at 1:03:46 PM, Redmumba (redmu...@gmail.com) wrote:
>>>>>
>>>>>   Not quite; if I'm at say 90% disk usage, I'd like to drop the
>>>>> oldest sstable rather than simply run out of space.
>>>>>
>>>>> The problem with using TTLs is that I have to try and guess how much
>>>>> data is being put in--since this is auditing data, the usage can vary
>>>>> wildly depending on time of year, verbosity of auditing, etc..  I'd like 
>>>>> to
>>>>> maximize the disk space--not optimize the cleanup process.
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>> On Wed, Jun 4, 2014 at 9:47 AM, Russell Bradberry <
>>>>> rbradbe...@gmail.com> wrote:
>>>>>
>>>>>>  You mean this:
>>>>>>
>>>>>>  https://issues.apache.org/jira/browse/CASSANDRA-5228
>>>>>>
>>>>>>  ?
>>>>>>
>>>>>>
>>>>>>
>>>>>> On June 4, 2014 at 12:42:33 PM, Redmumba (redmu...@gmail.com) wrote:
>>>>>>
>>>>>>   Good morning!
>>>>>>
>>>>>> I've asked (and seen other people ask) about the ability to drop old
>>>>>> sstables, basically creating a FIFO-like clean-up process.  Since we're
>>>>>> using Cassandra as an auditing system, this is particularly appealing to 
>>>>>> us
>>>>>> because it means we can maximize the amount of auditing data we can keep
>>>>>> while still allowing Cassandra to clear old data automatically.
>>>>>>
>>>>>> My idea is this: perform compaction based on the range of dates
>>>>>> available in the sstable (or just metadata about when it was created).  
>>>>>> For
>>>>>> example, a major compaction could create a combined sstable per day--so
>>>>>> that, say, 60 days of data after a major compaction would contain 60
>>>>>> sstables.
>>>>>>
>>>>>> My question then is, will this be possible by simply implementing a
>>>>>> separate AbstractCompactionStrategy?  Does this sound feasilble at all?
>>>>>> Based on the implementation of Size and Leveled strategies, it looks 
>>>>>> like I
>>>>>> would have the ability to control what and how things get compacted, but 
>>>>>> I
>>>>>> wanted to verify before putting time into it.
>>>>>>
>>>>>> Thank you so much for your time!
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> Jon Haddad
> http://www.rustyrazorblade.com
> skype: rustyrazorblade
>

Reply via email to