There's a Cassandra example project called twissandra[1] that is twitter clone that uses Cassandra. We're working with time series data in the activity feed and this is what twissandra models.
Activities are added into a feeds for each appropriate user. In the case of anonymous and everyone, they're treated as special users (anon = public, everyone = logged in users). If you have a column family (ActivityFeedTimeLine) with a row key of the userID, you would add activities to everyone that should see the activity (1 write per 'follower'). The columns here should have a compound key where the first component is the timestamp of the activity, the second is the activityID and the third is the column's name (title, etc). I pick activityID instead of contentID because of the super rare chance that an activity happens on the same piece of content in multiple nodes at the same time. * [<timestamp>, <activityID>, <columnName>] = columnValue This sorts the columns by order of time so reads are a sequential scan (start here, read until here; return) when looking for "what happened in the last X time" or "what happened between X and Y". To get the last Y many activities, you can ask cassandra for Z many columns from a CF. Z = Y * (num columns stored per activity). So if we store 3 fields and you want the last 10 activities, ask for 30 columns. This is a another sequential read for Cassandra which means it's screaming fast. Deleting activities from a feed when some content's visibility is reduced/changed can be handled by adding another CF, ActivityFeedByContentID, so we can make sure looking up the content differently is still a sequential, deterministic read for Cassandra. This CF's row key is still the userID but the columns are composed differently: * [<contentID>, <timestamp>, <activityID>] = null Using this structure, we can look up data using the userID and slice the columns by contentID:*:* to get the activities a user has seen for a piece of content. This gives us the timestamps & activityIDs to then delete content specific events from ActivityFeedTimeLine using the same row key and timestamp:activityID:*. An interesting cleanup bit in cassandra is that this data can be set to expire if the activity feeds should be automatically culled. That should cover seeing new activities and removing activities when content becomes invisible to someone. It does not cover backfilling a timeline when content becomes visible to someone. That would require another CF to store the activities produced for each content item. We could maintain the last X activities for a content item where X is the most a person will see in their feed assuming worst case that someone has to fill their feed using one content item. A strong point to consider here is that this would require knowing how many columns are stored for each content item which means touching every column in the row to prune for the last X columns. This could involve multiple files for a fragmented node but would hopefully be a small enough number to not bother our setup. Implementation details. Let me know if that sounds wonky or creates more questions than were answered. Carl 1 https://github.com/twissandra/twissandra/ On Thu, Jul 19, 2012 at 7:25 AM, Chris Tweney <[email protected]>wrote: > That would help a bit, as any caching we add can improve things. This > particular feed is a good candidate for static caching, which is very easy > to add. Anonymous and logged-in users should access it through a different > URL though, since they get different results. This could be as simple as an > arbitrary appendage to the URI (?anon=true vs ?anon=false). Also, the > front-page "Recent activity" widget should probably change so it doesn't > poll the server every 15 seconds. > > This doesn't leave Cassandra off the hook though... still want to see > those answers! :) > > -chris > > On Thu, 19 Jul 2012 11:29:06 +0200, Nicolaas Matthijs > <[email protected]> wrote: > > I'm afraid I won't be able to help with the Cassandra question. > > > > However, I did want to point out that there is no need for these > > results and queries to be real-time. For this particular feed, > > anonymous users only need to see things that are public, and logged in > > users only need to see things that are public or visible to logged in > > users. I think that means that the results can be published and > > cached, and re-published from time to time. > > > > Not sure if this changes anything, but maybe it can be helpful. > > > > Kind regards, > > Nicolaas > > > > > > On 19 Jul 2012, at 01:02, Chris Tweney wrote: > > > >> I just submitted a pull request for KERN-3036 [1] that seemed way > >> harder > >> than it should be thanks to our Solr-plus-KV architecture. I'd like to > >> challenge those who know a lot about Cassandra to see if that platform > >> would make it easier to fix this issue. > >> > >> Capsule version: To support /var/search/activity/all.json, we need a > >> query that returns all the activities that point to nodes to which I > >> have read permission. The read permission inheres in the nodes > >> themselves, not in the activities (which just inherit their parents' > >> permissions). > >> > >> In a relational model it's an easy problem: Just a three-table join > >> from > >> Activity to Node to Permission. > >> > >> In Solr, the fix was less straightforward. Solr joins (documented at > >> [2]) are not fully equivalent to SQL joins. In particular, they're > >> more > >> like inner queries than like true SQL joins. Also, joins between two > >> different types of documents must be made on fields that the two > >> document types do NOT share. In other words the foreign key has to > >> have > >> a different name from the key it points to. These constraints made it > >> tricky, but I did finally find the right Solr syntax for this search: > >> > >> q=resourceType[* TO *]&fq={!join from=path > >> to=activitysource}(readers:anonymous OR readers:everyone) > >> > >> Does Cassandra make this case any easier? If so, show me how! > >> > >> -chris > >> > >> [1] https://jira.sakaiproject.org/browse/KERN-3036 > >> > >> [2] http://wiki.apache.org/solr/Join > >> _______________________________________________ > >> oae-dev mailing list > >> [email protected] > >> http://collab.sakaiproject.org/mailman/listinfo/oae-dev > _______________________________________________ > oae-dev mailing list > [email protected] > http://collab.sakaiproject.org/mailman/listinfo/oae-dev >
_______________________________________________ oae-dev mailing list [email protected] http://collab.sakaiproject.org/mailman/listinfo/oae-dev
