Re: [oae-dev] Hard case for searches... can Cassandra make it easier?

Carl Hall Mon, 23 Jul 2012 01:01:28 -0700

There's a Cassandra example project called twissandra[1] that is twitter
clone that uses Cassandra. We're working with time series data in the
activity feed and this is what twissandra models.

Activities are added into a feeds for each appropriate user. In the case of
anonymous and everyone, they're treated as special users (anon = public,
everyone = logged in users).

If you have a column family (ActivityFeedTimeLine) with a row key of the
userID, you would add activities to everyone that should see the activity
(1 write per 'follower'). The columns here should have a compound key where
the first component is the timestamp of the activity, the second is the
activityID and the third is the column's  name (title, etc). I pick
activityID instead of contentID because of the super rare chance that an
activity happens on the same piece of content in multiple nodes at the same
time.
* [<timestamp>, <activityID>, <columnName>] = columnValue

This sorts the columns by order of time so reads are a sequential scan
(start here, read until here; return) when looking for "what happened in
the last X time" or "what happened between X and Y".

To get the last Y many activities, you can ask cassandra for Z many columns
from a CF. Z = Y * (num columns stored per activity). So if we store 3
fields and you want the last 10 activities, ask for 30 columns. This is a
another sequential read for Cassandra which means it's screaming fast.

Deleting activities from a feed when some content's visibility is
reduced/changed can be handled by adding another CF,
ActivityFeedByContentID, so we can make sure looking up the content
differently is still a sequential, deterministic read for Cassandra. This
CF's row key is still the userID but the columns are composed differently:
* [<contentID>, <timestamp>, <activityID>] = null

Using this structure, we can look up data using the userID and slice the
columns by contentID:*:* to get the activities a user has seen for a piece
of content. This gives us the timestamps & activityIDs to then delete
content specific events from ActivityFeedTimeLine using the same row key
and timestamp:activityID:*.

An interesting cleanup bit in cassandra is that this data can be set to
expire if the activity feeds should be automatically culled.

That should cover seeing new activities and removing activities when
content becomes invisible to someone. It does not cover backfilling a
timeline when content becomes visible to someone. That would require
another CF to store the activities produced for each content item. We could
maintain the last X activities for a content item where X is the most a
person will see in their feed assuming worst case that someone has to fill
their feed using one content item. A strong point to consider here is that
this would require knowing how many columns are stored for each content
item which means touching every column in the row to prune for the last X
columns. This could involve multiple files for a fragmented node but would
hopefully be a small enough number to not bother our setup. Implementation
details.

Let me know if that sounds wonky or creates more questions than were
answered.
Carl

1 https://github.com/twissandra/twissandra/

On Thu, Jul 19, 2012 at 7:25 AM, Chris Tweney <[email protected]>wrote:

> That would help a bit, as any caching we add can improve things. This
> particular feed is a good candidate for static caching, which is very easy
> to add. Anonymous and logged-in users should access it through a different
> URL though, since they get different results. This could be as simple as an
> arbitrary appendage to the URI (?anon=true vs ?anon=false). Also, the
> front-page "Recent activity" widget should probably change so it doesn't
> poll the server every 15 seconds.
>
> This doesn't leave Cassandra off the hook though... still want to see
> those answers! :)
>
> -chris
>
> On Thu, 19 Jul 2012 11:29:06 +0200, Nicolaas Matthijs
> <[email protected]> wrote:
> > I'm afraid I won't be able to help with the Cassandra question.
> >
> > However, I did want to point out that there is no need for these
> > results and queries to be real-time. For this particular feed,
> > anonymous users only need to see things that are public, and logged in
> > users only need to see things that are public or visible to logged in
> > users. I think that means that the results can be published and
> > cached, and re-published from time to time.
> >
> > Not sure if this changes anything, but maybe it can be helpful.
> >
> > Kind regards,
> > Nicolaas
> >
> >
> > On 19 Jul 2012, at 01:02, Chris Tweney wrote:
> >
> >> I just submitted a pull request for KERN-3036 [1] that seemed way
> >> harder
> >> than it should be thanks to our Solr-plus-KV architecture. I'd like to
> >> challenge those who know a lot about Cassandra to see if that platform
> >> would make it easier to fix this issue.
> >>
> >> Capsule version: To support /var/search/activity/all.json, we need a
> >> query that returns all the activities that point to nodes to which I
> >> have read permission. The read permission inheres in the nodes
> >> themselves, not in the activities (which just inherit their parents'
> >> permissions).
> >>
> >> In a relational model it's an easy problem: Just a three-table join
> >> from
> >> Activity to Node to Permission.
> >>
> >> In Solr, the fix was less straightforward. Solr joins (documented at
> >> [2]) are not fully equivalent to SQL joins. In particular, they're
> >> more
> >> like inner queries than like true SQL joins. Also, joins between two
> >> different types of documents must be made on fields that the two
> >> document types do NOT share. In other words the foreign key has to
> >> have
> >> a different name from the key it points to. These constraints made it
> >> tricky, but I did finally find the right Solr syntax for this search:
> >>
> >> q=resourceType[* TO *]&fq={!join from=path
> >> to=activitysource}(readers:anonymous OR readers:everyone)
> >>
> >> Does Cassandra make this case any easier? If so, show me how!
> >>
> >> -chris
> >>
> >> [1] https://jira.sakaiproject.org/browse/KERN-3036
> >>
> >> [2] http://wiki.apache.org/solr/Join
> >> _______________________________________________
> >> oae-dev mailing list
> >> [email protected]
> >> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
> _______________________________________________
> oae-dev mailing list
> [email protected]
> http://collab.sakaiproject.org/mailman/listinfo/oae-dev
>

_______________________________________________
oae-dev mailing list
[email protected]
http://collab.sakaiproject.org/mailman/listinfo/oae-dev

Re: [oae-dev] Hard case for searches... can Cassandra make it easier?

Reply via email to