Re: Schema questions for data structures with recently-modified access patterns

2015-07-24 Thread Robert Wille
When performing an update, the following needs to happen:

1. Read document.last_modified
2. Get the current timestamp
3. Update document with last_modified=current timestamp
4. Insert into doc_by_last_modified with last_modified=current timestamp
5. Delete from doc_by_last_modified with last_modified=the timestamp from step 1

If two parties do the above at roughly the same time, such that in step 1 they 
both read the same last_modified timestamp, then when they do step 5, they’ll 
both delete the same old record from doc_by_last_modified, and you’ll get two 
records for the same document in doc_by_last_modified.

Would it work to put steps 3-5 into an atomic batch and use a lightweight 
transaction for step 3? (e.g. UPDATE document SET doc = :doc, last_modified = 
:cur_ts WHERE docid = :docid IF last_modified = :prev_ts) If a lightweight 
transaction is batched with other statements on other tables, will the other 
statements get cancelled if the lightweight transaction is cancelled?

Robert

On Jul 23, 2015, at 9:49 PM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:

Concurrent update should not be problematic. Duplicate entries should not be 
created. If it appears to be, explain your apparent issue so we can see whether 
it is a real issue.

But at least from all of the details you have disclosed so far, there does not 
appear to be any indication that this type of time series would be anything 
other than a good fit for Cassandra.

Besides, the new materialized view feature of Cassandra 3.0 would make it an 
even easier fit.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 6:30 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
I obviously worded my original email poorly. I guess that’s what happens when 
you post at the end of the day just before quitting.

I want to get a list of documents, ordered from most-recently modified to 
least-recently modified, with each document appearing exactly once.

Jack, your schema does exactly that, and is essentially the same as mine (with 
exception of my missing the DESC clause, and I have a partitioning column and 
you only have clustering columns).

The problem I have with my schema (or Jack’s) is that it is very easy for a 
document to get in the list multiple times. Concurrent updates to the document, 
for example. Also, a consistency issue could cause the document to appear in 
the list more than once.

I think that Alec Collier’s comment is probably accurate, that this kind of a 
pattern just isn’t a good fit for Cassandra.

On Jul 23, 2015, at 1:54 PM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:

Maybe you could explain in more detail what you mean by recently modified 
documents, since that is precisely what I thought I suggested with descending 
ordering.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query 
recently-modified documents.

His updated suggestion provides a way to get recently-modified documents, but 
not ordered.

On Jul 22, 2015, at 4:19 PM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:

No way to query recently-modified documents.

I don't follow why you say that. I mean, that was the point of the data model 
suggestion I proposed. Maybe you could clarify.

I also wanted to mention that the new materialized view feature of Cassandra 
3.0 might handle this use case, including taking care of the delete, 
automatically.


-- Jack Krupansky

On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
The time series doesn’t provide the access pattern I’m looking for. No way to 
query recently-modified documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso 
i...@mrcalonso.commailto:i...@mrcalonso.com wrote:

Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the 

Re: Schema questions for data structures with recently-modified access patterns

2015-07-23 Thread Jack Krupansky
Concurrent update should not be problematic. Duplicate entries should not
be created. If it appears to be, explain your apparent issue so we can see
whether it is a real issue.

But at least from all of the details you have disclosed so far, there does
not appear to be any indication that this type of time series would be
anything other than a good fit for Cassandra.

Besides, the new materialized view feature of Cassandra 3.0 would make it
an even easier fit.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 6:30 PM, Robert Wille rwi...@fold3.com wrote:

  I obviously worded my original email poorly. I guess that’s what happens
 when you post at the end of the day just before quitting.

  I want to get a list of documents, ordered from most-recently modified
 to least-recently modified, with each document appearing exactly once.

  Jack, your schema does exactly that, and is essentially the same as mine
 (with exception of my missing the DESC clause, and I have a partitioning
 column and you only have clustering columns).

  The problem I have with my schema (or Jack’s) is that it is very easy
 for a document to get in the list multiple times. Concurrent updates to the
 document, for example. Also, a consistency issue could cause the document
 to appear in the list more than once.

  I think that Alec Collier’s comment is probably accurate, that this kind
 of a pattern just isn’t a good fit for Cassandra.

  On Jul 23, 2015, at 1:54 PM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

  Maybe you could explain in more detail what you mean by recently
 modified documents, since that is precisely what I thought I suggested with
 descending ordering.

  -- Jack Krupansky

 On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille rwi...@fold3.com wrote:

 Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query
 recently-modified documents.

  His updated suggestion provides a way to get recently-modified
 documents, but not ordered.

  On Jul 22, 2015, at 4:19 PM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

  No way to query recently-modified documents.

  I don't follow why you say that. I mean, that was the point of the data
 model suggestion I proposed. Maybe you could clarify.

  I also wanted to mention that the new materialized view feature of
 Cassandra 3.0 might handle this use case, including taking care of the
 delete, automatically.


  -- Jack Krupansky

 On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.com wrote:

 The time series doesn’t provide the access pattern I’m looking for. No
 way to query recently-modified documents.

  On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote:

  Hi Robert,

  What about modelling it as a time serie?

  CREATE TABLE document (
   docId UUID,
   doc TEXT,
   last_modified TIMESTAMP
   PRIMARY KEY(docId, last_modified)
 ) WITH CLUSTERING ORDER BY (last_modified DESC);

  This way, you the lastest modification will always be the first record
 in the row, therefore accessing it should be as easy as:

  SELECT * FROM document WHERE docId == the docId LIMIT 1;

  And, if you experience diskspace issues due to very long rows, then
 you can always expire old ones using TTL or on a batch job. Tombstones will
 never be a problem in this case as, due to the specified clustering order,
 the latest modification will always be first record in the row.

  Hope it helps.

  Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote:

 Data structures that have a recently-modified access pattern seem to be
 a poor fit for Cassandra. I’m wondering if any of you smart guys can
 provide suggestions.

 For the sake of discussion, lets assume I have the following tables:

 CREATE TABLE document (
 docId UUID,
 doc TEXT,
 last_modified TIMEUUID,
 PRIMARY KEY ((docid))
 )

 CREATE TABLE doc_by_last_modified (
 date TEXT,
 last_modified TIMEUUID,
 docId UUID,
 PRIMARY KEY ((date), last_modified)
 )

 When I update a document, I retrieve its last_modified time, delete the
 current record from doc_by_last_modified, and add a new one. Unfortunately,
 if you’d like each document to appear at most once in the
 doc_by_last_modified table, then this doesn’t work so well.

 Documents can get into the doc_by_last_modified table multiple times if
 there is concurrent access, or if there is a consistency issue.

 Any thoughts out there on how to efficiently provide recently-modified
 access to a table? This problem exists for many types of data structures,
 not just recently-modified. Any ordered data structure that can be
 dynamically reordered suffers from the same problems. As I’ve been doing
 schema design, this pattern keeps recurring. A nice way to address this
 problem has lots of applications.

 Thanks in advance for your thoughts

 Robert










Re: Schema questions for data structures with recently-modified access patterns

2015-07-23 Thread Robert Wille
Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query 
recently-modified documents.

His updated suggestion provides a way to get recently-modified documents, but 
not ordered.

On Jul 22, 2015, at 4:19 PM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:

No way to query recently-modified documents.

I don't follow why you say that. I mean, that was the point of the data model 
suggestion I proposed. Maybe you could clarify.

I also wanted to mention that the new materialized view feature of Cassandra 
3.0 might handle this use case, including taking care of the delete, 
automatically.


-- Jack Krupansky

On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
The time series doesn’t provide the access pattern I’m looking for. No way to 
query recently-modified documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso 
i...@mrcalonso.commailto:i...@mrcalonso.com wrote:

Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
docId UUID,
doc TEXT,
last_modified TIMEUUID,
PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
date TEXT,
last_modified TIMEUUID,
docId UUID,
PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current 
record from doc_by_last_modified, and add a new one. Unfortunately, if you’d 
like each document to appear at most once in the doc_by_last_modified table, 
then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there 
is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access 
to a table? This problem exists for many types of data structures, not just 
recently-modified. Any ordered data structure that can be dynamically reordered 
suffers from the same problems. As I’ve been doing schema design, this pattern 
keeps recurring. A nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert







Re: Schema questions for data structures with recently-modified access patterns

2015-07-23 Thread Jack Krupansky
Maybe you could explain in more detail what you mean by recently modified
documents, since that is precisely what I thought I suggested with
descending ordering.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille rwi...@fold3.com wrote:

  Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query
 recently-modified documents.

  His updated suggestion provides a way to get recently-modified
 documents, but not ordered.

  On Jul 22, 2015, at 4:19 PM, Jack Krupansky jack.krupan...@gmail.com
 wrote:

  No way to query recently-modified documents.

  I don't follow why you say that. I mean, that was the point of the data
 model suggestion I proposed. Maybe you could clarify.

  I also wanted to mention that the new materialized view feature of
 Cassandra 3.0 might handle this use case, including taking care of the
 delete, automatically.


  -- Jack Krupansky

 On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.com wrote:

 The time series doesn’t provide the access pattern I’m looking for. No
 way to query recently-modified documents.

  On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote:

  Hi Robert,

  What about modelling it as a time serie?

  CREATE TABLE document (
   docId UUID,
   doc TEXT,
   last_modified TIMESTAMP
   PRIMARY KEY(docId, last_modified)
 ) WITH CLUSTERING ORDER BY (last_modified DESC);

  This way, you the lastest modification will always be the first record
 in the row, therefore accessing it should be as easy as:

  SELECT * FROM document WHERE docId == the docId LIMIT 1;

  And, if you experience diskspace issues due to very long rows, then you
 can always expire old ones using TTL or on a batch job. Tombstones will
 never be a problem in this case as, due to the specified clustering order,
 the latest modification will always be first record in the row.

  Hope it helps.

  Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote:

 Data structures that have a recently-modified access pattern seem to be
 a poor fit for Cassandra. I’m wondering if any of you smart guys can
 provide suggestions.

 For the sake of discussion, lets assume I have the following tables:

 CREATE TABLE document (
 docId UUID,
 doc TEXT,
 last_modified TIMEUUID,
 PRIMARY KEY ((docid))
 )

 CREATE TABLE doc_by_last_modified (
 date TEXT,
 last_modified TIMEUUID,
 docId UUID,
 PRIMARY KEY ((date), last_modified)
 )

 When I update a document, I retrieve its last_modified time, delete the
 current record from doc_by_last_modified, and add a new one. Unfortunately,
 if you’d like each document to appear at most once in the
 doc_by_last_modified table, then this doesn’t work so well.

 Documents can get into the doc_by_last_modified table multiple times if
 there is concurrent access, or if there is a consistency issue.

 Any thoughts out there on how to efficiently provide recently-modified
 access to a table? This problem exists for many types of data structures,
 not just recently-modified. Any ordered data structure that can be
 dynamically reordered suffers from the same problems. As I’ve been doing
 schema design, this pattern keeps recurring. A nice way to address this
 problem has lots of applications.

 Thanks in advance for your thoughts

 Robert








Re: Schema questions for data structures with recently-modified access patterns

2015-07-23 Thread Robert Wille
I obviously worded my original email poorly. I guess that’s what happens when 
you post at the end of the day just before quitting.

I want to get a list of documents, ordered from most-recently modified to 
least-recently modified, with each document appearing exactly once.

Jack, your schema does exactly that, and is essentially the same as mine (with 
exception of my missing the DESC clause, and I have a partitioning column and 
you only have clustering columns).

The problem I have with my schema (or Jack’s) is that it is very easy for a 
document to get in the list multiple times. Concurrent updates to the document, 
for example. Also, a consistency issue could cause the document to appear in 
the list more than once.

I think that Alec Collier’s comment is probably accurate, that this kind of a 
pattern just isn’t a good fit for Cassandra.

On Jul 23, 2015, at 1:54 PM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:

Maybe you could explain in more detail what you mean by recently modified 
documents, since that is precisely what I thought I suggested with descending 
ordering.

-- Jack Krupansky

On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query 
recently-modified documents.

His updated suggestion provides a way to get recently-modified documents, but 
not ordered.

On Jul 22, 2015, at 4:19 PM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:

No way to query recently-modified documents.

I don't follow why you say that. I mean, that was the point of the data model 
suggestion I proposed. Maybe you could clarify.

I also wanted to mention that the new materialized view feature of Cassandra 
3.0 might handle this use case, including taking care of the delete, 
automatically.


-- Jack Krupansky

On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
The time series doesn’t provide the access pattern I’m looking for. No way to 
query recently-modified documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso 
i...@mrcalonso.commailto:i...@mrcalonso.com wrote:

Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
docId UUID,
doc TEXT,
last_modified TIMEUUID,
PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
date TEXT,
last_modified TIMEUUID,
docId UUID,
PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current 
record from doc_by_last_modified, and add a new one. Unfortunately, if you’d 
like each document to appear at most once in the doc_by_last_modified table, 
then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there 
is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access 
to a table? This problem exists for many types of data structures, not just 
recently-modified. Any ordered data structure that can be dynamically reordered 
suffers from the same problems. As I’ve been doing schema design, this pattern 
keeps recurring. A nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert









Re: Schema questions for data structures with recently-modified access patterns

2015-07-22 Thread Carlos Alonso
Ah, so you your access pattern is to get all documents modified in a
particular date, right?

Then I think your approach is good, and to avoid duplication, why don't add
the docId as the first clustering column and remove the last_modified field
from it?
That way, your primary key would be PRIMARY KEY(date, docId), making all
docs modified in same day be together in the same partition, and on the
other hand, two updates on the same date won't generate a two rows as the
primary key would be exactly the same.

Does it make sense?

Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

On 21 July 2015 at 18:37, Robert Wille rwi...@fold3.com wrote:

  The time series doesn’t provide the access pattern I’m looking for. No
 way to query recently-modified documents.

  On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote:

  Hi Robert,

  What about modelling it as a time serie?

  CREATE TABLE document (
   docId UUID,
   doc TEXT,
   last_modified TIMESTAMP
   PRIMARY KEY(docId, last_modified)
 ) WITH CLUSTERING ORDER BY (last_modified DESC);

  This way, you the lastest modification will always be the first record
 in the row, therefore accessing it should be as easy as:

  SELECT * FROM document WHERE docId == the docId LIMIT 1;

  And, if you experience diskspace issues due to very long rows, then you
 can always expire old ones using TTL or on a batch job. Tombstones will
 never be a problem in this case as, due to the specified clustering order,
 the latest modification will always be first record in the row.

  Hope it helps.

  Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote:

 Data structures that have a recently-modified access pattern seem to be a
 poor fit for Cassandra. I’m wondering if any of you smart guys can provide
 suggestions.

 For the sake of discussion, lets assume I have the following tables:

 CREATE TABLE document (
 docId UUID,
 doc TEXT,
 last_modified TIMEUUID,
 PRIMARY KEY ((docid))
 )

 CREATE TABLE doc_by_last_modified (
 date TEXT,
 last_modified TIMEUUID,
 docId UUID,
 PRIMARY KEY ((date), last_modified)
 )

 When I update a document, I retrieve its last_modified time, delete the
 current record from doc_by_last_modified, and add a new one. Unfortunately,
 if you’d like each document to appear at most once in the
 doc_by_last_modified table, then this doesn’t work so well.

 Documents can get into the doc_by_last_modified table multiple times if
 there is concurrent access, or if there is a consistency issue.

 Any thoughts out there on how to efficiently provide recently-modified
 access to a table? This problem exists for many types of data structures,
 not just recently-modified. Any ordered data structure that can be
 dynamically reordered suffers from the same problems. As I’ve been doing
 schema design, this pattern keeps recurring. A nice way to address this
 problem has lots of applications.

 Thanks in advance for your thoughts

 Robert






Re: Schema questions for data structures with recently-modified access patterns

2015-07-22 Thread Jack Krupansky
No way to query recently-modified documents.

I don't follow why you say that. I mean, that was the point of the data
model suggestion I proposed. Maybe you could clarify.

I also wanted to mention that the new materialized view feature of
Cassandra 3.0 might handle this use case, including taking care of the
delete, automatically.


-- Jack Krupansky

On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.com wrote:

  The time series doesn’t provide the access pattern I’m looking for. No
 way to query recently-modified documents.

  On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote:

  Hi Robert,

  What about modelling it as a time serie?

  CREATE TABLE document (
   docId UUID,
   doc TEXT,
   last_modified TIMESTAMP
   PRIMARY KEY(docId, last_modified)
 ) WITH CLUSTERING ORDER BY (last_modified DESC);

  This way, you the lastest modification will always be the first record
 in the row, therefore accessing it should be as easy as:

  SELECT * FROM document WHERE docId == the docId LIMIT 1;

  And, if you experience diskspace issues due to very long rows, then you
 can always expire old ones using TTL or on a batch job. Tombstones will
 never be a problem in this case as, due to the specified clustering order,
 the latest modification will always be first record in the row.

  Hope it helps.

  Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote:

 Data structures that have a recently-modified access pattern seem to be a
 poor fit for Cassandra. I’m wondering if any of you smart guys can provide
 suggestions.

 For the sake of discussion, lets assume I have the following tables:

 CREATE TABLE document (
 docId UUID,
 doc TEXT,
 last_modified TIMEUUID,
 PRIMARY KEY ((docid))
 )

 CREATE TABLE doc_by_last_modified (
 date TEXT,
 last_modified TIMEUUID,
 docId UUID,
 PRIMARY KEY ((date), last_modified)
 )

 When I update a document, I retrieve its last_modified time, delete the
 current record from doc_by_last_modified, and add a new one. Unfortunately,
 if you’d like each document to appear at most once in the
 doc_by_last_modified table, then this doesn’t work so well.

 Documents can get into the doc_by_last_modified table multiple times if
 there is concurrent access, or if there is a consistency issue.

 Any thoughts out there on how to efficiently provide recently-modified
 access to a table? This problem exists for many types of data structures,
 not just recently-modified. Any ordered data structure that can be
 dynamically reordered suffers from the same problems. As I’ve been doing
 schema design, this pattern keeps recurring. A nice way to address this
 problem has lots of applications.

 Thanks in advance for your thoughts

 Robert






RE: Schema questions for data structures with recently-modified access patterns

2015-07-22 Thread Alec Collier
I believe what he really wants is to be able to search for the x most recently 
modified documents, i.e. without specifying the docID.

I don’t believe there is a ‘nice’ way of doing this in Cassandra by itself, 
given it really favours key-value storage. Even having the date as the 
partition key is usually not recommended because it means all writes on a given 
date will be hitting one node.

Perhaps Solr integration is the way to go for this access pattern?

Alec Collier

From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Thursday, 23 July 2015 8:20 AM
To: user@cassandra.apache.org
Subject: Re: Schema questions for data structures with recently-modified access 
patterns

No way to query recently-modified documents.

I don't follow why you say that. I mean, that was the point of the data model 
suggestion I proposed. Maybe you could clarify.

I also wanted to mention that the new materialized view feature of Cassandra 
3.0 might handle this use case, including taking care of the delete, 
automatically.


-- Jack Krupansky

On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
The time series doesn’t provide the access pattern I’m looking for. No way to 
query recently-modified documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso 
i...@mrcalonso.commailto:i...@mrcalonso.com wrote:


Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
docId UUID,
doc TEXT,
last_modified TIMEUUID,
PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
date TEXT,
last_modified TIMEUUID,
docId UUID,
PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current 
record from doc_by_last_modified, and add a new one. Unfortunately, if you’d 
like each document to appear at most once in the doc_by_last_modified table, 
then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there 
is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access 
to a table? This problem exists for many types of data structures, not just 
recently-modified. Any ordered data structure that can be dynamically reordered 
suffers from the same problems. As I’ve been doing schema design, this pattern 
keeps recurring. A nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert




This email, including any attachments, is confidential. If you are not the 
intended recipient, you must not disclose, distribute or use the information in 
this email in any way. If you received this email in error, please notify the 
sender immediately by return email and delete the message. Unless expressly 
stated otherwise, the information in this email should not be regarded as an 
offer to sell or as a solicitation of an offer to buy any financial product or 
service, an official confirmation of any transaction, or as an official 
statement of the entity sending this message. Neither Macquarie Group Limited, 
nor any of its subsidiaries, guarantee the integrity of any emails or attached 
files and are not responsible for any changes made to them by any other person.


Re: Schema questions for data structures with recently-modified access patterns

2015-07-21 Thread Robert Wille
The time series doesn’t provide the access pattern I’m looking for. No way to 
query recently-modified documents.

On Jul 21, 2015, at 9:13 AM, Carlos Alonso 
i...@mrcalonso.commailto:i...@mrcalonso.com wrote:

Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
docId UUID,
doc TEXT,
last_modified TIMEUUID,
PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
date TEXT,
last_modified TIMEUUID,
docId UUID,
PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current 
record from doc_by_last_modified, and add a new one. Unfortunately, if you’d 
like each document to appear at most once in the doc_by_last_modified table, 
then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there 
is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access 
to a table? This problem exists for many types of data structures, not just 
recently-modified. Any ordered data structure that can be dynamically reordered 
suffers from the same problems. As I’ve been doing schema design, this pattern 
keeps recurring. A nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert





Re: Schema questions for data structures with recently-modified access patterns

2015-07-21 Thread Carlos Alonso
Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in
the row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can
always expire old ones using TTL or on a batch job. Tombstones will never
be a problem in this case as, due to the specified clustering order, the
latest modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote:

 Data structures that have a recently-modified access pattern seem to be a
 poor fit for Cassandra. I’m wondering if any of you smart guys can provide
 suggestions.

 For the sake of discussion, lets assume I have the following tables:

 CREATE TABLE document (
 docId UUID,
 doc TEXT,
 last_modified TIMEUUID,
 PRIMARY KEY ((docid))
 )

 CREATE TABLE doc_by_last_modified (
 date TEXT,
 last_modified TIMEUUID,
 docId UUID,
 PRIMARY KEY ((date), last_modified)
 )

 When I update a document, I retrieve its last_modified time, delete the
 current record from doc_by_last_modified, and add a new one. Unfortunately,
 if you’d like each document to appear at most once in the
 doc_by_last_modified table, then this doesn’t work so well.

 Documents can get into the doc_by_last_modified table multiple times if
 there is concurrent access, or if there is a consistency issue.

 Any thoughts out there on how to efficiently provide recently-modified
 access to a table? This problem exists for many types of data structures,
 not just recently-modified. Any ordered data structure that can be
 dynamically reordered suffers from the same problems. As I’ve been doing
 schema design, this pattern keeps recurring. A nice way to address this
 problem has lots of applications.

 Thanks in advance for your thoughts

 Robert




Re: Schema questions for data structures with recently-modified access patterns

2015-07-21 Thread Jack Krupansky
Keep the original document base table, but then the query table should have
the PK as last_modified, docId, with last_modified descending, so that a
query can get the n most recently modified documents.

Yes, you still need to manually delete the old entry for the document in
the query table if duplicates are a problem for you.

Yeah, a TTL would be good if you don't care about documents modified a
month or a week ago.

-- Jack Krupansky

On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso i...@mrcalonso.com wrote:

 Hi Robert,

 What about modelling it as a time serie?

 CREATE TABLE document (
   docId UUID,
   doc TEXT,
   last_modified TIMESTAMP
   PRIMARY KEY(docId, last_modified)
 ) WITH CLUSTERING ORDER BY (last_modified DESC);

 This way, you the lastest modification will always be the first record in
 the row, therefore accessing it should be as easy as:

 SELECT * FROM document WHERE docId == the docId LIMIT 1;

 And, if you experience diskspace issues due to very long rows, then you
 can always expire old ones using TTL or on a batch job. Tombstones will
 never be a problem in this case as, due to the specified clustering order,
 the latest modification will always be first record in the row.

 Hope it helps.

 Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

 On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote:

 Data structures that have a recently-modified access pattern seem to be a
 poor fit for Cassandra. I’m wondering if any of you smart guys can provide
 suggestions.

 For the sake of discussion, lets assume I have the following tables:

 CREATE TABLE document (
 docId UUID,
 doc TEXT,
 last_modified TIMEUUID,
 PRIMARY KEY ((docid))
 )

 CREATE TABLE doc_by_last_modified (
 date TEXT,
 last_modified TIMEUUID,
 docId UUID,
 PRIMARY KEY ((date), last_modified)
 )

 When I update a document, I retrieve its last_modified time, delete the
 current record from doc_by_last_modified, and add a new one. Unfortunately,
 if you’d like each document to appear at most once in the
 doc_by_last_modified table, then this doesn’t work so well.

 Documents can get into the doc_by_last_modified table multiple times if
 there is concurrent access, or if there is a consistency issue.

 Any thoughts out there on how to efficiently provide recently-modified
 access to a table? This problem exists for many types of data structures,
 not just recently-modified. Any ordered data structure that can be
 dynamically reordered suffers from the same problems. As I’ve been doing
 schema design, this pattern keeps recurring. A nice way to address this
 problem has lots of applications.

 Thanks in advance for your thoughts

 Robert





Re: Schema questions for data structures with recently-modified access patterns

2015-07-21 Thread Robert Wille
If last_modified is a clustering column, it needs a partitioning column, which 
is what date is for (although I should have named it day, and I also forgot to 
add the order by desc clause). This is essentially what I came up with. Still 
not liking how easy it is to get duplicates.

On Jul 21, 2015, at 9:31 AM, Jack Krupansky 
jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote:

Keep the original document base table, but then the query table should have the 
PK as last_modified, docId, with last_modified descending, so that a query can 
get the n most recently modified documents.

Yes, you still need to manually delete the old entry for the document in the 
query table if duplicates are a problem for you.

Yeah, a TTL would be good if you don't care about documents modified a month or 
a week ago.

-- Jack Krupansky

On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso 
i...@mrcalonso.commailto:i...@mrcalonso.com wrote:
Hi Robert,

What about modelling it as a time serie?

CREATE TABLE document (
  docId UUID,
  doc TEXT,
  last_modified TIMESTAMP
  PRIMARY KEY(docId, last_modified)
) WITH CLUSTERING ORDER BY (last_modified DESC);

This way, you the lastest modification will always be the first record in the 
row, therefore accessing it should be as easy as:

SELECT * FROM document WHERE docId == the docId LIMIT 1;

And, if you experience diskspace issues due to very long rows, then you can 
always expire old ones using TTL or on a batch job. Tombstones will never be a 
problem in this case as, due to the specified clustering order, the latest 
modification will always be first record in the row.

Hope it helps.

Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso

On 21 July 2015 at 05:59, Robert Wille 
rwi...@fold3.commailto:rwi...@fold3.com wrote:
Data structures that have a recently-modified access pattern seem to be a poor 
fit for Cassandra. I’m wondering if any of you smart guys can provide 
suggestions.

For the sake of discussion, lets assume I have the following tables:

CREATE TABLE document (
docId UUID,
doc TEXT,
last_modified TIMEUUID,
PRIMARY KEY ((docid))
)

CREATE TABLE doc_by_last_modified (
date TEXT,
last_modified TIMEUUID,
docId UUID,
PRIMARY KEY ((date), last_modified)
)

When I update a document, I retrieve its last_modified time, delete the current 
record from doc_by_last_modified, and add a new one. Unfortunately, if you’d 
like each document to appear at most once in the doc_by_last_modified table, 
then this doesn’t work so well.

Documents can get into the doc_by_last_modified table multiple times if there 
is concurrent access, or if there is a consistency issue.

Any thoughts out there on how to efficiently provide recently-modified access 
to a table? This problem exists for many types of data structures, not just 
recently-modified. Any ordered data structure that can be dynamically reordered 
suffers from the same problems. As I’ve been doing schema design, this pattern 
keeps recurring. A nice way to address this problem has lots of applications.

Thanks in advance for your thoughts

Robert






Re: Schema questions for data structures with recently-modified access patterns

2015-07-21 Thread Victor
I'm relatively new to data modeling in Cassandra, but perhaps instead of
date and last_modified in your primary key for doc_by_last_modified, just
use the docId. This way, you are can update the last_modified and date
fields against the docId and it removes the duplicate issue and obviates
the need to delete the current row or adding a new one-- you'd simply be
updating (upserting?) by the docId 

Regards,
Victor

On Mon, Jul 20, 2015 at 11:59 PM, Robert Wille rwi...@fold3.com wrote:

 Data structures that have a recently-modified access pattern seem to be a
 poor fit for Cassandra. I’m wondering if any of you smart guys can provide
 suggestions.

 For the sake of discussion, lets assume I have the following tables:

 CREATE TABLE document (
 docId UUID,
 doc TEXT,
 last_modified TIMEUUID,
 PRIMARY KEY ((docid))
 )

 CREATE TABLE doc_by_last_modified (
 date TEXT,
 last_modified TIMEUUID,
 docId UUID,
 PRIMARY KEY ((date), last_modified)
 )

 When I update a document, I retrieve its last_modified time, delete the
 current record from doc_by_last_modified, and add a new one. Unfortunately,
 if you’d like each document to appear at most once in the
 doc_by_last_modified table, then this doesn’t work so well.

 Documents can get into the doc_by_last_modified table multiple times if
 there is concurrent access, or if there is a consistency issue.

 Any thoughts out there on how to efficiently provide recently-modified
 access to a table? This problem exists for many types of data structures,
 not just recently-modified. Any ordered data structure that can be
 dynamically reordered suffers from the same problems. As I’ve been doing
 schema design, this pattern keeps recurring. A nice way to address this
 problem has lots of applications.

 Thanks in advance for your thoughts

 Robert