Hi all,
When creating a social network type application the feed is king, the most common query that will be run against your service will be a request to get the feed. There are of course many ways to model the feed and a couple of components that are required. To do a very simple feed one might structure the data as follows: Users follow other users (n:USER)-[:FOLLOWS]->(m:USER) http://neo4j-console-20.herokuapp.com/?id=xp1hea Users can then post stuff (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()}) http://neo4j-console-20.herokuapp.com/?id=4p5aui The feed query then becomes: MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) WHERE n.id = 1 RETURN m, p ORDER BY p.time DESC http://neo4j-console-20.herokuapp.com/?id=msje6d All this looks well and good, but there will be some problems. The feed is generally returned in reverse date order. When browsing twitter for example we see a list of tweets with the most recent tweet at the top. To load the whole feed for a user would be ridiculous so what happens is the top 20-50 tweets are loaded and when you scroll down more are loaded as you need them. In order to achieve this with the data structure outlined above one would have to structure the queries as follows. I'm going to return 2 at a time as I don't have that many in the db) MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) WHERE n.id = 1 RETURN m, p ORDER BY p.time DESC LIMIT 2 http://neo4j-console-20.herokuapp.com/r/ggclr In order to get the next 2 after this we have to match where time is greater than the last post returned by the previous query MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) WHERE n.id = 1 AND p.time < 1392464619659 //this time stamp is the last one returned RETURN m, p ORDER BY p.time DESC LIMIT 2 http://neo4j-console-20.herokuapp.com/?id=ggclr So there we have a basic feed, it has a few issues however. The order by clause is never cheap. I don't know the internals and how well they are optimised but could imagine that in the worst case scenario the query would have to fetch every single post, order the lot of them and then return a subset. As the feed query will be the most important query we will probably need a more efficient way to get this information. Enter the linked list. Instead of each post only being associated with the user who made the post we will create a linked list holding the feed for every user. Initially structured like so: (n:USER)-[:FEED]->(n) http://neo4j-console-20.herokuapp.com/r/sa5kxm Whenever a user posts we find all of the users followers and add the post into each of their feeds. In other words we do the most work doing the insertion. The insertion is expensive, the feed request should be inexpensive. So when a post is added we add an item to each of the linked lists of the users that are following the poster. Confused? Heres the code. Inserting a new post gets a little more complicated MATCH (n:USER) WHERE n.id = 4 CREATE (n)-[:POSTS]->(p:POST { time: TIMESTAMP(), content: 'Here is the post' }) WITH n, p MATCH followers-[:FOLLOWS]->n WITH followers, n, p MATCH followers-[oldFeed:FEED]->after CREATE (feedItem:FEEDITEM)-[:FEEDPOST]->p CREATE followers-[:FEED]->feedItem-[:FEED]->after DELETE oldFeed Here we have the linked list insertion method when inserting at the head of the list. First we match the head node MATCH followers-[oldFeed:FEED]->after Then we insert a new node CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after Then we delete the old link DELETE oldFeed The reason the structure does not look like this (n:USER)-[:FEED]->(p:POST)-[:FEED]->after Is that the post will appear in multiple users feeds so we need a pseudo feed item that is specific to each user otherwise when we query the feed the traversal will end up getting mixed through a heap of different users feeds. Here is a graph with a few posts inserted: http://neo4j-console-20.herokuapp.com/?id=otayz5 So now we come to the all important feed query, it ends up looking like this MATCH (n:USER) WHERE n.id = 1 MATCH n-[:FEED*]->(feedItem) WHERE feedItem <> n WITH feedItem MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster RETURN post, poster http://neo4j-console-20.herokuapp.com/r/rnne41 Now if we want to limit the number of items that are returned by the query, we limit the number of relationships we follow like so (if we want the top 3 items) MATCH (n:USER) WHERE n.id = 1 MATCH n-[:FEED*0..3]->(feedItem) WHERE feedItem <> n WITH feedItem MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster RETURN post, poster http://neo4j-console-20.herokuapp.com/r/gfu2la This system works a lot better, the feed items are already ordered, so we do not need the ORDER BY clause, the feed contains the users own posts so they are included as well. And because the query only follows one straight list of relations one would assume it was faster. Enter the demon in the corner. In order to have Neo4j (or any other storage mechanism) add entries to a linked list you need to take a lock on the node you are adding the next linked node to. If you do not and you get two inserts happening at the same time your linked list will branch with two nodes coming off the one node. I have done pretty extensive testing on this, the posts (as well as the findings) relating to this were asked on this mailing list if you are interested. Taking a lock is possible, the queries do get a lot more complicated, but I am worried about scalability. If a user has 200 friends I will have to take locks on their feeds. That means that if you have two friends that are posting items that should end up in your feed at the same time that a deadlock is going to occur. In certain networks (like twitter) some users have 100,000 followers. In a scenario such as this I would have to take a lock on 100,000 nodes. This means that while the user that has 100,000 friends is posting no one else that is friends with any of those 100,000 people will be able to post without getting a deadlock. This does not seem to be scalable. So should I go with the very simple approach where I have to do an order-by and posts are inserted like so? (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()}) This has a more expensive feed query but I don't have to worry about locks and linked lists and such. the queries are a lot simpler. What do you people think? -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
