I would love if you could convert your email just 1:1 into a graph gist for documentation and discussion.
See: gist.neo4j.org Then you don't have to link consoles. Cheers, Michael On Sun, Jun 29, 2014 at 12:01 AM, Aran Mulholland <[email protected]> wrote: > Hi all, > > > When creating a social network type application the feed is king, the most > common query that will be run against your service will be a request to get > the feed. There are of course many ways to model the feed and a couple of > components that are required. > > To do a very simple feed one might structure the data as follows: > > Users follow other users > > (n:USER)-[:FOLLOWS]->(m:USER) > > http://neo4j-console-20.herokuapp.com/?id=xp1hea > > > Users can then post stuff > > (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()}) > > http://neo4j-console-20.herokuapp.com/?id=4p5aui > > > The feed query then becomes: > > MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) > WHERE n.id = 1 > RETURN m, p > ORDER BY p.time DESC > > http://neo4j-console-20.herokuapp.com/?id=msje6d > > All this looks well and good, but there will be some problems. The feed is > generally returned in reverse date order. When browsing twitter for example > we see a list of tweets with the most recent tweet at the top. To load the > whole feed for a user would be ridiculous so what happens is the top 20-50 > tweets are loaded and when you scroll down more are loaded as you need > them. In order to achieve this with the data structure outlined above one > would have to structure the queries as follows. I'm going to return 2 at a > time as I don't have that many in the db) > > MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) > WHERE n.id = 1 > RETURN m, p > ORDER BY p.time DESC > LIMIT 2 > > http://neo4j-console-20.herokuapp.com/r/ggclr > > In order to get the next 2 after this we have to match where time is > greater than the last post returned by the previous query > > MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) > WHERE n.id = 1 AND p.time < 1392464619659 //this time stamp is the last > one returned > RETURN m, p > ORDER BY p.time DESC > LIMIT 2 > > http://neo4j-console-20.herokuapp.com/?id=ggclr > > So there we have a basic feed, it has a few issues however. The order by > clause is never cheap. I don't know the internals and how well they are > optimised but could imagine that in the worst case scenario the query would > have to fetch every single post, order the lot of them and then return a > subset. As the feed query will be the most important query we will probably > need a more efficient way to get this information. Enter the linked list. > > Instead of each post only being associated with the user who made the post > we will create a linked list holding the feed for every user. Initially > structured like so: > > (n:USER)-[:FEED]->(n) > > http://neo4j-console-20.herokuapp.com/r/sa5kxm > > Whenever a user posts we find all of the users followers and add the post > into each of their feeds. In other words we do the most work doing the > insertion. The insertion is expensive, the feed request should be > inexpensive. So when a post is added we add an item to each of the linked > lists of the users that are following the poster. Confused? Heres the code. > > Inserting a new post gets a little more complicated > > > MATCH (n:USER) > WHERE n.id = 4 > CREATE (n)-[:POSTS]->(p:POST { time: TIMESTAMP(), content: 'Here is the > post' }) > WITH n, p > MATCH followers-[:FOLLOWS]->n > WITH followers, n, p > MATCH followers-[oldFeed:FEED]->after > CREATE (feedItem:FEEDITEM)-[:FEEDPOST]->p > CREATE followers-[:FEED]->feedItem-[:FEED]->after > DELETE oldFeed > > > Here we have the linked list insertion method when inserting at the head > of the list. > > First we match the head node > > MATCH followers-[oldFeed:FEED]->after > > Then we insert a new node > > CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after > > Then we delete the old link > > DELETE oldFeed > > > The reason the structure does not look like this > > (n:USER)-[:FEED]->(p:POST)-[:FEED]->after > > Is that the post will appear in multiple users feeds so we need a pseudo > feed item that is specific to each user otherwise when we query the feed > the traversal will end up getting mixed through a heap of different users > feeds. > > > Here is a graph with a few posts inserted: > > http://neo4j-console-20.herokuapp.com/?id=otayz5 > > So now we come to the all important feed query, it ends up looking like > this > > > MATCH (n:USER) > WHERE n.id = 1 > MATCH n-[:FEED*]->(feedItem) > WHERE feedItem <> n > WITH feedItem > MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster > RETURN post, poster > > http://neo4j-console-20.herokuapp.com/r/rnne41 > > Now if we want to limit the number of items that are returned by the > query, we limit the number of relationships we follow like so (if we want > the top 3 items) > > MATCH (n:USER) > WHERE n.id = 1 > MATCH n-[:FEED*0..3]->(feedItem) > WHERE feedItem <> n > WITH feedItem > MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster > RETURN post, poster > > http://neo4j-console-20.herokuapp.com/r/gfu2la > > This system works a lot better, the feed items are already ordered, so we > do not need the ORDER BY clause, the feed contains the users own posts so > they are included as well. And because the query only follows one straight > list of relations one would assume it was faster. > > Enter the demon in the corner. In order to have Neo4j (or any other > storage mechanism) add entries to a linked list you need to take a lock on > the node you are adding the next linked node to. If you do not and you get > two inserts happening at the same time your linked list will branch with > two nodes coming off the one node. I have done pretty extensive testing on > this, the posts (as well as the findings) relating to this were asked on > this mailing list if you are interested. > > Taking a lock is possible, the queries do get a lot more complicated, but > I am worried about scalability. If a user has 200 friends I will have to > take locks on their feeds. That means that if you have two friends that are > posting items that should end up in your feed at the same time that a > deadlock is going to occur. In certain networks (like twitter) some users > have 100,000 followers. In a scenario such as this I would have to take a > lock on 100,000 nodes. This means that while the user that has 100,000 > friends is posting no one else that is friends with any of those 100,000 > people will be able to post without getting a deadlock. This does not seem > to be scalable. > > So should I go with the very simple approach where I have to do an > order-by and posts are inserted like so? > > (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()}) > > This has a more expensive feed query but I don't have to worry about locks > and linked lists and such. the queries are a lot simpler. > > What do you people think? > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
