Re: Schema questions for data structures with recently-modified access patterns

Jack Krupansky Thu, 23 Jul 2015 20:49:29 -0700

Concurrent update should not be problematic. Duplicate entries should not
be created. If it appears to be, explain your apparent issue so we can see
whether it is a real issue.


But at least from all of the details you have disclosed so far, there does
not appear to be any indication that this type of time series would be
anything other than a good fit for Cassandra.

Besides, the new materialized view feature of Cassandra 3.0 would make it
an even easier fit.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 6:30 PM, Robert Wille <rwi...@fold3.com> wrote:

>  I obviously worded my original email poorly. I guess that’s what happens
> when you post at the end of the day just before quitting.
>
>  I want to get a list of documents, ordered from most-recently modified
> to least-recently modified, with each document appearing exactly once.
>
>  Jack, your schema does exactly that, and is essentially the same as mine
> (with exception of my missing the DESC clause, and I have a partitioning
> column and you only have clustering columns).
>
>  The problem I have with my schema (or Jack’s) is that it is very easy
> for a document to get in the list multiple times. Concurrent updates to the
> document, for example. Also, a consistency issue could cause the document
> to appear in the list more than once.
>
>  I think that Alec Collier’s comment is probably accurate, that this kind
> of a pattern just isn’t a good fit for Cassandra.
>
>  On Jul 23, 2015, at 1:54 PM, Jack Krupansky <jack.krupan...@gmail.com>
> wrote:
>
>  Maybe you could explain in more detail what you mean by recently
> modified documents, since that is precisely what I thought I suggested with
> descending ordering.
>
>  -- Jack Krupansky
>
> On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille <rwi...@fold3.com> wrote:
>
>> Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query
>> recently-modified documents.
>>
>>  His updated suggestion provides a way to get recently-modified
>> documents, but not ordered.
>>
>>  On Jul 22, 2015, at 4:19 PM, Jack Krupansky <jack.krupan...@gmail.com>
>> wrote:
>>
>>  "No way to query recently-modified documents."
>>
>>  I don't follow why you say that. I mean, that was the point of the data
>> model suggestion I proposed. Maybe you could clarify.
>>
>>  I also wanted to mention that the new materialized view feature of
>> Cassandra 3.0 might handle this use case, including taking care of the
>> delete, automatically.
>>
>>
>>  -- Jack Krupansky
>>
>> On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille <rwi...@fold3.com> wrote:
>>
>>> The time series doesn’t provide the access pattern I’m looking for. No
>>> way to query recently-modified documents.
>>>
>>>  On Jul 21, 2015, at 9:13 AM, Carlos Alonso <i...@mrcalonso.com> wrote:
>>>
>>>  Hi Robert,
>>>
>>>  What about modelling it as a time serie?
>>>
>>>  CREATE TABLE document (
>>>   docId UUID,
>>>   doc TEXT,
>>>   last_modified TIMESTAMP
>>>   PRIMARY KEY(docId, last_modified)
>>> ) WITH CLUSTERING ORDER BY (last_modified DESC);
>>>
>>>  This way, you the lastest modification will always be the first record
>>> in the row, therefore accessing it should be as easy as:
>>>
>>>  SELECT * FROM document WHERE docId == <the docId> LIMIT 1;
>>>
>>>  And, if you experience diskspace issues due to very long rows, then
>>> you can always expire old ones using TTL or on a batch job. Tombstones will
>>> never be a problem in this case as, due to the specified clustering order,
>>> the latest modification will always be first record in the row.
>>>
>>>  Hope it helps.
>>>
>>>  Carlos Alonso | Software Engineer | @calonso
>>> <https://twitter.com/calonso>
>>>
>>> On 21 July 2015 at 05:59, Robert Wille <rwi...@fold3.com> wrote:
>>>
>>>> Data structures that have a recently-modified access pattern seem to be
>>>> a poor fit for Cassandra. I’m wondering if any of you smart guys can
>>>> provide suggestions.
>>>>
>>>> For the sake of discussion, lets assume I have the following tables:
>>>>
>>>> CREATE TABLE document (
>>>>         docId UUID,
>>>>         doc TEXT,
>>>>         last_modified TIMEUUID,
>>>>         PRIMARY KEY ((docid))
>>>> )
>>>>
>>>> CREATE TABLE doc_by_last_modified (
>>>>         date TEXT,
>>>>         last_modified TIMEUUID,
>>>>         docId UUID,
>>>>         PRIMARY KEY ((date), last_modified)
>>>> )
>>>>
>>>> When I update a document, I retrieve its last_modified time, delete the
>>>> current record from doc_by_last_modified, and add a new one. Unfortunately,
>>>> if you’d like each document to appear at most once in the
>>>> doc_by_last_modified table, then this doesn’t work so well.
>>>>
>>>> Documents can get into the doc_by_last_modified table multiple times if
>>>> there is concurrent access, or if there is a consistency issue.
>>>>
>>>> Any thoughts out there on how to efficiently provide recently-modified
>>>> access to a table? This problem exists for many types of data structures,
>>>> not just recently-modified. Any ordered data structure that can be
>>>> dynamically reordered suffers from the same problems. As I’ve been doing
>>>> schema design, this pattern keeps recurring. A nice way to address this
>>>> problem has lots of applications.
>>>>
>>>> Thanks in advance for your thoughts
>>>>
>>>> Robert
>>>>
>>>>
>>>
>>>
>>
>>
>
>

Re: Schema questions for data structures with recently-modified access patterns

Reply via email to