Re: Schema questions for data structures with recently-modified access patterns
When performing an update, the following needs to happen: 1. Read document.last_modified 2. Get the current timestamp 3. Update document with last_modified=current timestamp 4. Insert into doc_by_last_modified with last_modified=current timestamp 5. Delete from doc_by_last_modified with last_modified=the timestamp from step 1 If two parties do the above at roughly the same time, such that in step 1 they both read the same last_modified timestamp, then when they do step 5, they’ll both delete the same old record from doc_by_last_modified, and you’ll get two records for the same document in doc_by_last_modified. Would it work to put steps 3-5 into an atomic batch and use a lightweight transaction for step 3? (e.g. UPDATE document SET doc = :doc, last_modified = :cur_ts WHERE docid = :docid IF last_modified = :prev_ts) If a lightweight transaction is batched with other statements on other tables, will the other statements get cancelled if the lightweight transaction is cancelled? Robert On Jul 23, 2015, at 9:49 PM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: Concurrent update should not be problematic. Duplicate entries should not be created. If it appears to be, explain your apparent issue so we can see whether it is a real issue. But at least from all of the details you have disclosed so far, there does not appear to be any indication that this type of time series would be anything other than a good fit for Cassandra. Besides, the new materialized view feature of Cassandra 3.0 would make it an even easier fit. -- Jack Krupansky On Thu, Jul 23, 2015 at 6:30 PM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: I obviously worded my original email poorly. I guess that’s what happens when you post at the end of the day just before quitting. I want to get a list of documents, ordered from most-recently modified to least-recently modified, with each document appearing exactly once. Jack, your schema does exactly that, and is essentially the same as mine (with exception of my missing the DESC clause, and I have a partitioning column and you only have clustering columns). The problem I have with my schema (or Jack’s) is that it is very easy for a document to get in the list multiple times. Concurrent updates to the document, for example. Also, a consistency issue could cause the document to appear in the list more than once. I think that Alec Collier’s comment is probably accurate, that this kind of a pattern just isn’t a good fit for Cassandra. On Jul 23, 2015, at 1:54 PM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: Maybe you could explain in more detail what you mean by recently modified documents, since that is precisely what I thought I suggested with descending ordering. -- Jack Krupansky On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query recently-modified documents. His updated suggestion provides a way to get recently-modified documents, but not ordered. On Jul 22, 2015, at 4:19 PM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: No way to query recently-modified documents. I don't follow why you say that. I mean, that was the point of the data model suggestion I proposed. Maybe you could clarify. I also wanted to mention that the new materialized view feature of Cassandra 3.0 might handle this use case, including taking care of the delete, automatically. -- Jack Krupansky On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the
Re: Schema questions for data structures with recently-modified access patterns
Concurrent update should not be problematic. Duplicate entries should not be created. If it appears to be, explain your apparent issue so we can see whether it is a real issue. But at least from all of the details you have disclosed so far, there does not appear to be any indication that this type of time series would be anything other than a good fit for Cassandra. Besides, the new materialized view feature of Cassandra 3.0 would make it an even easier fit. -- Jack Krupansky On Thu, Jul 23, 2015 at 6:30 PM, Robert Wille rwi...@fold3.com wrote: I obviously worded my original email poorly. I guess that’s what happens when you post at the end of the day just before quitting. I want to get a list of documents, ordered from most-recently modified to least-recently modified, with each document appearing exactly once. Jack, your schema does exactly that, and is essentially the same as mine (with exception of my missing the DESC clause, and I have a partitioning column and you only have clustering columns). The problem I have with my schema (or Jack’s) is that it is very easy for a document to get in the list multiple times. Concurrent updates to the document, for example. Also, a consistency issue could cause the document to appear in the list more than once. I think that Alec Collier’s comment is probably accurate, that this kind of a pattern just isn’t a good fit for Cassandra. On Jul 23, 2015, at 1:54 PM, Jack Krupansky jack.krupan...@gmail.com wrote: Maybe you could explain in more detail what you mean by recently modified documents, since that is precisely what I thought I suggested with descending ordering. -- Jack Krupansky On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille rwi...@fold3.com wrote: Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query recently-modified documents. His updated suggestion provides a way to get recently-modified documents, but not ordered. On Jul 22, 2015, at 4:19 PM, Jack Krupansky jack.krupan...@gmail.com wrote: No way to query recently-modified documents. I don't follow why you say that. I mean, that was the point of the data model suggestion I proposed. Maybe you could clarify. I also wanted to mention that the new materialized view feature of Cassandra 3.0 might handle this use case, including taking care of the delete, automatically. -- Jack Krupansky On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query recently-modified documents. His updated suggestion provides a way to get recently-modified documents, but not ordered. On Jul 22, 2015, at 4:19 PM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: No way to query recently-modified documents. I don't follow why you say that. I mean, that was the point of the data model suggestion I proposed. Maybe you could clarify. I also wanted to mention that the new materialized view feature of Cassandra 3.0 might handle this use case, including taking care of the delete, automatically. -- Jack Krupansky On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
Maybe you could explain in more detail what you mean by recently modified documents, since that is precisely what I thought I suggested with descending ordering. -- Jack Krupansky On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille rwi...@fold3.com wrote: Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query recently-modified documents. His updated suggestion provides a way to get recently-modified documents, but not ordered. On Jul 22, 2015, at 4:19 PM, Jack Krupansky jack.krupan...@gmail.com wrote: No way to query recently-modified documents. I don't follow why you say that. I mean, that was the point of the data model suggestion I proposed. Maybe you could clarify. I also wanted to mention that the new materialized view feature of Cassandra 3.0 might handle this use case, including taking care of the delete, automatically. -- Jack Krupansky On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
I obviously worded my original email poorly. I guess that’s what happens when you post at the end of the day just before quitting. I want to get a list of documents, ordered from most-recently modified to least-recently modified, with each document appearing exactly once. Jack, your schema does exactly that, and is essentially the same as mine (with exception of my missing the DESC clause, and I have a partitioning column and you only have clustering columns). The problem I have with my schema (or Jack’s) is that it is very easy for a document to get in the list multiple times. Concurrent updates to the document, for example. Also, a consistency issue could cause the document to appear in the list more than once. I think that Alec Collier’s comment is probably accurate, that this kind of a pattern just isn’t a good fit for Cassandra. On Jul 23, 2015, at 1:54 PM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: Maybe you could explain in more detail what you mean by recently modified documents, since that is precisely what I thought I suggested with descending ordering. -- Jack Krupansky On Thu, Jul 23, 2015 at 3:40 PM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Carlos’ suggestion (nor yours) didn’t didn’t provide a way to query recently-modified documents. His updated suggestion provides a way to get recently-modified documents, but not ordered. On Jul 22, 2015, at 4:19 PM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: No way to query recently-modified documents. I don't follow why you say that. I mean, that was the point of the data model suggestion I proposed. Maybe you could clarify. I also wanted to mention that the new materialized view feature of Cassandra 3.0 might handle this use case, including taking care of the delete, automatically. -- Jack Krupansky On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
Ah, so you your access pattern is to get all documents modified in a particular date, right? Then I think your approach is good, and to avoid duplication, why don't add the docId as the first clustering column and remove the last_modified field from it? That way, your primary key would be PRIMARY KEY(date, docId), making all docs modified in same day be together in the same partition, and on the other hand, two updates on the same date won't generate a two rows as the primary key would be exactly the same. Does it make sense? Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 18:37, Robert Wille rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
No way to query recently-modified documents. I don't follow why you say that. I mean, that was the point of the data model suggestion I proposed. Maybe you could clarify. I also wanted to mention that the new materialized view feature of Cassandra 3.0 might handle this use case, including taking care of the delete, automatically. -- Jack Krupansky On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
RE: Schema questions for data structures with recently-modified access patterns
I believe what he really wants is to be able to search for the x most recently modified documents, i.e. without specifying the docID. I don’t believe there is a ‘nice’ way of doing this in Cassandra by itself, given it really favours key-value storage. Even having the date as the partition key is usually not recommended because it means all writes on a given date will be hitting one node. Perhaps Solr integration is the way to go for this access pattern? Alec Collier From: Jack Krupansky [mailto:jack.krupan...@gmail.com] Sent: Thursday, 23 July 2015 8:20 AM To: user@cassandra.apache.org Subject: Re: Schema questions for data structures with recently-modified access patterns No way to query recently-modified documents. I don't follow why you say that. I mean, that was the point of the data model suggestion I proposed. Maybe you could clarify. I also wanted to mention that the new materialized view feature of Cassandra 3.0 might handle this use case, including taking care of the delete, automatically. -- Jack Krupansky On Tue, Jul 21, 2015 at 12:37 PM, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert This email, including any attachments, is confidential. If you are not the intended recipient, you must not disclose, distribute or use the information in this email in any way. If you received this email in error, please notify the sender immediately by return email and delete the message. Unless expressly stated otherwise, the information in this email should not be regarded as an offer to sell or as a solicitation of an offer to buy any financial product or service, an official confirmation of any transaction, or as an official statement of the entity sending this message. Neither Macquarie Group Limited, nor any of its subsidiaries, guarantee the integrity of any emails or attached files and are not responsible for any changes made to them by any other person.
Re: Schema questions for data structures with recently-modified access patterns
The time series doesn’t provide the access pattern I’m looking for. No way to query recently-modified documents. On Jul 21, 2015, at 9:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
Keep the original document base table, but then the query table should have the PK as last_modified, docId, with last_modified descending, so that a query can get the n most recently modified documents. Yes, you still need to manually delete the old entry for the document in the query table if duplicates are a problem for you. Yeah, a TTL would be good if you don't care about documents modified a month or a week ago. -- Jack Krupansky On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
If last_modified is a clustering column, it needs a partitioning column, which is what date is for (although I should have named it day, and I also forgot to add the order by desc clause). This is essentially what I came up with. Still not liking how easy it is to get duplicates. On Jul 21, 2015, at 9:31 AM, Jack Krupansky jack.krupan...@gmail.commailto:jack.krupan...@gmail.com wrote: Keep the original document base table, but then the query table should have the PK as last_modified, docId, with last_modified descending, so that a query can get the n most recently modified documents. Yes, you still need to manually delete the old entry for the document in the query table if duplicates are a problem for you. Yeah, a TTL would be good if you don't care about documents modified a month or a week ago. -- Jack Krupansky On Tue, Jul 21, 2015 at 11:13 AM, Carlos Alonso i...@mrcalonso.commailto:i...@mrcalonso.com wrote: Hi Robert, What about modelling it as a time serie? CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMESTAMP PRIMARY KEY(docId, last_modified) ) WITH CLUSTERING ORDER BY (last_modified DESC); This way, you the lastest modification will always be the first record in the row, therefore accessing it should be as easy as: SELECT * FROM document WHERE docId == the docId LIMIT 1; And, if you experience diskspace issues due to very long rows, then you can always expire old ones using TTL or on a batch job. Tombstones will never be a problem in this case as, due to the specified clustering order, the latest modification will always be first record in the row. Hope it helps. Carlos Alonso | Software Engineer | @calonsohttps://twitter.com/calonso On 21 July 2015 at 05:59, Robert Wille rwi...@fold3.commailto:rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert
Re: Schema questions for data structures with recently-modified access patterns
I'm relatively new to data modeling in Cassandra, but perhaps instead of date and last_modified in your primary key for doc_by_last_modified, just use the docId. This way, you are can update the last_modified and date fields against the docId and it removes the duplicate issue and obviates the need to delete the current row or adding a new one-- you'd simply be updating (upserting?) by the docId Regards, Victor On Mon, Jul 20, 2015 at 11:59 PM, Robert Wille rwi...@fold3.com wrote: Data structures that have a recently-modified access pattern seem to be a poor fit for Cassandra. I’m wondering if any of you smart guys can provide suggestions. For the sake of discussion, lets assume I have the following tables: CREATE TABLE document ( docId UUID, doc TEXT, last_modified TIMEUUID, PRIMARY KEY ((docid)) ) CREATE TABLE doc_by_last_modified ( date TEXT, last_modified TIMEUUID, docId UUID, PRIMARY KEY ((date), last_modified) ) When I update a document, I retrieve its last_modified time, delete the current record from doc_by_last_modified, and add a new one. Unfortunately, if you’d like each document to appear at most once in the doc_by_last_modified table, then this doesn’t work so well. Documents can get into the doc_by_last_modified table multiple times if there is concurrent access, or if there is a consistency issue. Any thoughts out there on how to efficiently provide recently-modified access to a table? This problem exists for many types of data structures, not just recently-modified. Any ordered data structure that can be dynamically reordered suffers from the same problems. As I’ve been doing schema design, this pattern keeps recurring. A nice way to address this problem has lots of applications. Thanks in advance for your thoughts Robert