[Neo4j] Re: Social network design

Aseem Kishore Sun, 29 Jun 2014 21:16:23 -0700

Fantastic post.

It's rare to see such thorough and fully thought-out posts in the public 
IMHO. A lot of people are doing very fine work with Neo4j and Cypher; the 
really robust stuff just doesn't tend to see the light of the day, I've 
found.


E.g. You implement pagination robustly, filtering (WHERE) by time. And you 
astutely discovered that you need to have unique nodes in everyone's feed, 
otherwise you'll get mixed feed paths. (You can add an "owner ID" type of 
property to the relationships, but that's less efficient than just using 
separate nodes. You could also index those relationships via those 
properties, but that's also a bigger headache.)

Nice work again. So good to have something this thorough and accurate and 
*correct* publicly documented.

I heartily echo Michael's suggestion to make this into a proper post 
somewhere, whether as a Graph Gist or your own blog post or something else. 
I'd definitely love to point people to it! (And Google Groups just doesn't 
do this justice.)

—

To your actual q, I sort of agree with your identification of the problem — 
concurrent writes to the same users' linked lists — but I'm really 
surprised by your reported symptom — that your linked lists start branching 
/ get two heads. I'm stunned. Have you actually seen that? Or are you just 
speculating?

In my own experience, Neo4j throws a deadlock error in the case of these 
concurrent writes. The linked lists don't branch, because every Cypher 
query is an atomic transaction.

It's still a real pain to me that Neo4j even throws a deadlock error 
though. I want Neo4j to be smart enough to simply serialize those writes. 
(IOW, I don't understand why a deadlock needs to happen. Can't you e.g. 
order the locks by node ID? Of course, I'm ignorant of the implementation 
details here.) In practice, the workaround is to simply retry those writes. 
But that's shifting the burden from Neo4j to the caller.

However, I haven't investigated/tested this deeply myself. We actually 
moved away from linked lists to simplify our own code and data model. While 
ORDER BY definitely doesn't scale well, we're going to attack that in 
practice via external caching for now.

—

A couple of minor corrections/improvements to your queries:

- You say the feed contains the user's own posts, but your write query here 
doesn't do that. To fix, just make the "followers" match variable-length to 
include the own user: MATCH (followers) -[:FOLLOWS*0..1]-> (n).

- I'm not sure how you set up your linked lists, but unless they're 
circular (linking back to the user), you may need to be robust to the list 
being empty initially. E.g. in the write query, change the MATCH 
(followers) -[oldFeed:FEED]-> (after) to an OPTIONAL MATCH instead. But 
it's trickier than that...

Hope this helps!

Really fantastic post again, and looking forward to hopefully seeing it 
somewhere better than here. =)

Aseem


On Saturday, June 28, 2014 6:02:00 PM UTC-4, Aran Mulholland wrote:
>
> Hi all,
>
>
> When creating a social network type application the feed is king, the most 
> common query that will be run against your service will be a request to get 
> the feed. There are of course many ways to model the feed and a couple of 
> components that are required.
>
> To do a very simple feed one might structure the data as follows:
>
> Users follow other users
>
> (n:USER)-[:FOLLOWS]->(m:USER)
>
> http://neo4j-console-20.herokuapp.com/?id=xp1hea
>
>
> Users can then post stuff
>
> (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})
>
> http://neo4j-console-20.herokuapp.com/?id=4p5aui
>
>
> The feed query then becomes:
>
> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) 
> WHERE n.id = 1 
> RETURN m, p 
> ORDER BY p.time DESC
>
> http://neo4j-console-20.herokuapp.com/?id=msje6d
>
> All this looks well and good, but there will be some problems. The feed is 
> generally returned in reverse date order. When browsing twitter for example 
> we see a list of tweets with the most recent tweet at the top. To load the 
> whole feed for a user would be ridiculous so what happens is the top 20-50 
> tweets are loaded and when you scroll down more are loaded as you need 
> them. In order to achieve this with the data structure outlined above one 
> would have to structure the queries as follows. I'm going to return 2 at a 
> time as I don't have that many in the db)
>
> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) 
> WHERE n.id = 1 
> RETURN m, p 
> ORDER BY p.time DESC 
> LIMIT 2
>  
> http://neo4j-console-20.herokuapp.com/r/ggclr
>
> In order to get the next 2 after this we have to match where time is 
> greater than the last post returned by the previous query
>
> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST) 
> WHERE n.id = 1 AND p.time < 1392464619659 //this time stamp is the last 
> one returned
> RETURN m, p 
> ORDER BY p.time DESC 
> LIMIT 2
>
> http://neo4j-console-20.herokuapp.com/?id=ggclr
>
> So there we have a basic feed, it has a few issues however. The order by 
> clause is never cheap. I don't know the internals and how well they are 
> optimised but could imagine that in the worst case scenario the query would 
> have to fetch every single post, order the lot of them and then return a 
> subset. As the feed query will be the most important query we will probably 
> need a more efficient way to get this information. Enter the linked list.
>
> Instead of each post only being associated with the user who made the post 
> we will create a linked list holding the feed for every user. Initially 
> structured like so:
>
> (n:USER)-[:FEED]->(n)
>
> http://neo4j-console-20.herokuapp.com/r/sa5kxm
>
> Whenever a user posts we find all of the users followers and add the post 
> into each of their feeds. In other words we do the most work doing the 
> insertion. The insertion is expensive, the feed request should be 
> inexpensive. So when a post is added we add an item to each of the linked 
> lists of the users that are following the poster. Confused? Heres the code.
>
> Inserting a new post gets a little more complicated
>
>
> MATCH (n:USER) 
> WHERE n.id = 4 
> CREATE (n)-[:POSTS]->(p:POST { time: TIMESTAMP(), content: 'Here is the 
> post' }) 
> WITH n, p 
> MATCH followers-[:FOLLOWS]->n 
> WITH followers, n, p 
> MATCH followers-[oldFeed:FEED]->after 
> CREATE (feedItem:FEEDITEM)-[:FEEDPOST]->p 
> CREATE followers-[:FEED]->feedItem-[:FEED]->after 
> DELETE oldFeed
>
>
> Here we have the linked list insertion method when inserting at the head 
> of the list. 
>
> First we match the head node
>
> MATCH followers-[oldFeed:FEED]->after 
>
> Then we insert a new node
>
> CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after 
>
> Then we delete the old link
>
> DELETE oldFeed
>
>
> The reason the structure does not look like this
>
> (n:USER)-[:FEED]->(p:POST)-[:FEED]->after
>
> Is that the post will appear in multiple users feeds so we need a pseudo 
> feed item that is specific to each user otherwise when we query the feed 
> the traversal will end up getting mixed through a  heap of different users 
> feeds.
>
>
> Here is a graph with a few posts inserted:
>
> http://neo4j-console-20.herokuapp.com/?id=otayz5
>
> So now we come to the all important feed query, it ends up looking like 
> this
>
>
> MATCH (n:USER) 
> WHERE n.id = 1 
> MATCH n-[:FEED*]->(feedItem) 
> WHERE feedItem <> n 
> WITH feedItem 
>  MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster 
> RETURN post, poster
>
> http://neo4j-console-20.herokuapp.com/r/rnne41
>
> Now if we want to limit the number of items that are returned by the 
> query, we limit the number of relationships we follow like so (if we want 
> the top 3 items)
>
> MATCH (n:USER) 
> WHERE n.id = 1 
> MATCH n-[:FEED*0..3]->(feedItem) 
> WHERE feedItem <> n 
> WITH feedItem 
> MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster 
> RETURN post, poster
>
> http://neo4j-console-20.herokuapp.com/r/gfu2la
>
> This system works a lot better, the feed items are already ordered, so we 
> do not need the ORDER BY clause, the feed contains the users own posts so 
> they are included as well. And because the query only follows one straight 
> list of relations one would assume it was faster.
>
> Enter the demon in the corner. In order to have Neo4j (or any other 
> storage mechanism) add entries to a linked list you need to take a lock on 
> the node you are adding the next linked node to. If you do not and you get 
> two inserts happening at the same time your linked list will branch with 
> two nodes coming off the one node. I have done pretty extensive testing on 
> this, the posts (as well as the findings) relating to this were asked on 
> this mailing list if you are interested.
>
> Taking a lock is possible, the queries do get a lot more complicated, but 
> I am worried about scalability. If a user has 200 friends I will have to 
> take locks on their feeds. That means that if you have two friends that are 
> posting items that should end up in your feed at the same time that a 
> deadlock is going to occur. In certain networks (like twitter) some users 
> have 100,000 followers. In a scenario such as this I would have to take a 
> lock on 100,000 nodes. This means that while the user that has 100,000 
> friends is posting no one else that is friends with any of those 100,000 
> people will be able to post without getting a deadlock. This does not seem 
> to be scalable.
>
> So should I go with the very simple approach where I have to do an 
> order-by and posts are inserted like so?
>
> (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})
>
> This has a more expensive feed query but I don't have to worry about locks 
> and linked lists and such. the queries are a lot simpler.
>
> What do you people think?
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Re: Social network design

Reply via email to