[Neo4j] Social network design

Aran Mulholland Sat, 28 Jun 2014 15:02:02 -0700

Hi all,


When creating a social network type application the feed is king, the most
common query that will be run against your service will be a request to get
the feed. There are of course many ways to model the feed and a couple of
components that are required.

To do a very simple feed one might structure the data as follows:

Users follow other users

(n:USER)-[:FOLLOWS]->(m:USER)

http://neo4j-console-20.herokuapp.com/?id=xp1hea


Users can then post stuff

(n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})

http://neo4j-console-20.herokuapp.com/?id=4p5aui


The feed query then becomes:

MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.id = 1
RETURN m, p
ORDER BY p.time DESC

http://neo4j-console-20.herokuapp.com/?id=msje6d

All this looks well and good, but there will be some problems. The feed is
generally returned in reverse date order. When browsing twitter for example
we see a list of tweets with the most recent tweet at the top. To load the
whole feed for a user would be ridiculous so what happens is the top 20-50
tweets are loaded and when you scroll down more are loaded as you need
them. In order to achieve this with the data structure outlined above one
would have to structure the queries as follows. I'm going to return 2 at a
time as I don't have that many in the db)

MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.id = 1
RETURN m, p
ORDER BY p.time DESC
LIMIT 2

http://neo4j-console-20.herokuapp.com/r/ggclr

In order to get the next 2 after this we have to match where time is
greater than the last post returned by the previous query

MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
WHERE n.id = 1 AND p.time < 1392464619659 //this time stamp is the last one
returned
RETURN m, p
ORDER BY p.time DESC
LIMIT 2

http://neo4j-console-20.herokuapp.com/?id=ggclr

So there we have a basic feed, it has a few issues however. The order by
clause is never cheap. I don't know the internals and how well they are
optimised but could imagine that in the worst case scenario the query would
have to fetch every single post, order the lot of them and then return a
subset. As the feed query will be the most important query we will probably
need a more efficient way to get this information. Enter the linked list.

Instead of each post only being associated with the user who made the post
we will create a linked list holding the feed for every user. Initially
structured like so:

(n:USER)-[:FEED]->(n)

http://neo4j-console-20.herokuapp.com/r/sa5kxm

Whenever a user posts we find all of the users followers and add the post
into each of their feeds. In other words we do the most work doing the
insertion. The insertion is expensive, the feed request should be
inexpensive. So when a post is added we add an item to each of the linked
lists of the users that are following the poster. Confused? Heres the code.

Inserting a new post gets a little more complicated


MATCH (n:USER)
WHERE n.id = 4
CREATE (n)-[:POSTS]->(p:POST { time: TIMESTAMP(), content: 'Here is the
post' })
WITH n, p
MATCH followers-[:FOLLOWS]->n
WITH followers, n, p
MATCH followers-[oldFeed:FEED]->after
CREATE (feedItem:FEEDITEM)-[:FEEDPOST]->p
CREATE followers-[:FEED]->feedItem-[:FEED]->after
DELETE oldFeed


Here we have the linked list insertion method when inserting at the head of
the list.

First we match the head node

MATCH followers-[oldFeed:FEED]->after

Then we insert a new node

CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after

Then we delete the old link

DELETE oldFeed


The reason the structure does not look like this

(n:USER)-[:FEED]->(p:POST)-[:FEED]->after

Is that the post will appear in multiple users feeds so we need a pseudo
feed item that is specific to each user otherwise when we query the feed
the traversal will end up getting mixed through a  heap of different users
feeds.


Here is a graph with a few posts inserted:

http://neo4j-console-20.herokuapp.com/?id=otayz5

So now we come to the all important feed query, it ends up looking like this


MATCH (n:USER)
WHERE n.id = 1
MATCH n-[:FEED*]->(feedItem)
WHERE feedItem <> n
WITH feedItem
 MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster
RETURN post, poster

http://neo4j-console-20.herokuapp.com/r/rnne41

Now if we want to limit the number of items that are returned by the query,
we limit the number of relationships we follow like so (if we want the top
3 items)

MATCH (n:USER)
WHERE n.id = 1
MATCH n-[:FEED*0..3]->(feedItem)
WHERE feedItem <> n
WITH feedItem
MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster
RETURN post, poster

http://neo4j-console-20.herokuapp.com/r/gfu2la

This system works a lot better, the feed items are already ordered, so we
do not need the ORDER BY clause, the feed contains the users own posts so
they are included as well. And because the query only follows one straight
list of relations one would assume it was faster.

Enter the demon in the corner. In order to have Neo4j (or any other storage
mechanism) add entries to a linked list you need to take a lock on the node
you are adding the next linked node to. If you do not and you get two
inserts happening at the same time your linked list will branch with two
nodes coming off the one node. I have done pretty extensive testing on
this, the posts (as well as the findings) relating to this were asked on
this mailing list if you are interested.

Taking a lock is possible, the queries do get a lot more complicated, but I
am worried about scalability. If a user has 200 friends I will have to take
locks on their feeds. That means that if you have two friends that are
posting items that should end up in your feed at the same time that a
deadlock is going to occur. In certain networks (like twitter) some users
have 100,000 followers. In a scenario such as this I would have to take a
lock on 100,000 nodes. This means that while the user that has 100,000
friends is posting no one else that is friends with any of those 100,000
people will be able to post without getting a deadlock. This does not seem
to be scalable.

So should I go with the very simple approach where I have to do an order-by
and posts are inserted like so?

(n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})

This has a more expensive feed query but I don't have to worry about locks
and linked lists and such. the queries are a lot simpler.

What do you people think?

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] Social network design

Reply via email to