Re: [Neo4j] Social network design

Michael Hunger Sat, 28 Jun 2014 15:51:06 -0700

I would love if you could convert your email just 1:1 into a graph gist for
documentation and discussion.


See: gist.neo4j.org
Then you don't have to link consoles.

Cheers,

Michael


On Sun, Jun 29, 2014 at 12:01 AM, Aran Mulholland <[email protected]>
wrote:

> Hi all,
>
>
> When creating a social network type application the feed is king, the most
> common query that will be run against your service will be a request to get
> the feed. There are of course many ways to model the feed and a couple of
> components that are required.
>
> To do a very simple feed one might structure the data as follows:
>
> Users follow other users
>
> (n:USER)-[:FOLLOWS]->(m:USER)
>
> http://neo4j-console-20.herokuapp.com/?id=xp1hea
>
>
> Users can then post stuff
>
> (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})
>
> http://neo4j-console-20.herokuapp.com/?id=4p5aui
>
>
> The feed query then becomes:
>
> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
> WHERE n.id = 1
> RETURN m, p
> ORDER BY p.time DESC
>
> http://neo4j-console-20.herokuapp.com/?id=msje6d
>
> All this looks well and good, but there will be some problems. The feed is
> generally returned in reverse date order. When browsing twitter for example
> we see a list of tweets with the most recent tweet at the top. To load the
> whole feed for a user would be ridiculous so what happens is the top 20-50
> tweets are loaded and when you scroll down more are loaded as you need
> them. In order to achieve this with the data structure outlined above one
> would have to structure the queries as follows. I'm going to return 2 at a
> time as I don't have that many in the db)
>
> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
> WHERE n.id = 1
> RETURN m, p
> ORDER BY p.time DESC
> LIMIT 2
>
> http://neo4j-console-20.herokuapp.com/r/ggclr
>
> In order to get the next 2 after this we have to match where time is
> greater than the last post returned by the previous query
>
> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
> WHERE n.id = 1 AND p.time < 1392464619659 //this time stamp is the last
> one returned
> RETURN m, p
> ORDER BY p.time DESC
> LIMIT 2
>
> http://neo4j-console-20.herokuapp.com/?id=ggclr
>
> So there we have a basic feed, it has a few issues however. The order by
> clause is never cheap. I don't know the internals and how well they are
> optimised but could imagine that in the worst case scenario the query would
> have to fetch every single post, order the lot of them and then return a
> subset. As the feed query will be the most important query we will probably
> need a more efficient way to get this information. Enter the linked list.
>
> Instead of each post only being associated with the user who made the post
> we will create a linked list holding the feed for every user. Initially
> structured like so:
>
> (n:USER)-[:FEED]->(n)
>
> http://neo4j-console-20.herokuapp.com/r/sa5kxm
>
> Whenever a user posts we find all of the users followers and add the post
> into each of their feeds. In other words we do the most work doing the
> insertion. The insertion is expensive, the feed request should be
> inexpensive. So when a post is added we add an item to each of the linked
> lists of the users that are following the poster. Confused? Heres the code.
>
> Inserting a new post gets a little more complicated
>
>
> MATCH (n:USER)
> WHERE n.id = 4
> CREATE (n)-[:POSTS]->(p:POST { time: TIMESTAMP(), content: 'Here is the
> post' })
> WITH n, p
> MATCH followers-[:FOLLOWS]->n
> WITH followers, n, p
> MATCH followers-[oldFeed:FEED]->after
> CREATE (feedItem:FEEDITEM)-[:FEEDPOST]->p
> CREATE followers-[:FEED]->feedItem-[:FEED]->after
> DELETE oldFeed
>
>
> Here we have the linked list insertion method when inserting at the head
> of the list.
>
> First we match the head node
>
> MATCH followers-[oldFeed:FEED]->after
>
> Then we insert a new node
>
> CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after
>
> Then we delete the old link
>
> DELETE oldFeed
>
>
> The reason the structure does not look like this
>
> (n:USER)-[:FEED]->(p:POST)-[:FEED]->after
>
> Is that the post will appear in multiple users feeds so we need a pseudo
> feed item that is specific to each user otherwise when we query the feed
> the traversal will end up getting mixed through a  heap of different users
> feeds.
>
>
> Here is a graph with a few posts inserted:
>
> http://neo4j-console-20.herokuapp.com/?id=otayz5
>
> So now we come to the all important feed query, it ends up looking like
> this
>
>
> MATCH (n:USER)
> WHERE n.id = 1
> MATCH n-[:FEED*]->(feedItem)
> WHERE feedItem <> n
> WITH feedItem
>  MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster
> RETURN post, poster
>
> http://neo4j-console-20.herokuapp.com/r/rnne41
>
> Now if we want to limit the number of items that are returned by the
> query, we limit the number of relationships we follow like so (if we want
> the top 3 items)
>
> MATCH (n:USER)
> WHERE n.id = 1
> MATCH n-[:FEED*0..3]->(feedItem)
> WHERE feedItem <> n
> WITH feedItem
> MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster
> RETURN post, poster
>
> http://neo4j-console-20.herokuapp.com/r/gfu2la
>
> This system works a lot better, the feed items are already ordered, so we
> do not need the ORDER BY clause, the feed contains the users own posts so
> they are included as well. And because the query only follows one straight
> list of relations one would assume it was faster.
>
> Enter the demon in the corner. In order to have Neo4j (or any other
> storage mechanism) add entries to a linked list you need to take a lock on
> the node you are adding the next linked node to. If you do not and you get
> two inserts happening at the same time your linked list will branch with
> two nodes coming off the one node. I have done pretty extensive testing on
> this, the posts (as well as the findings) relating to this were asked on
> this mailing list if you are interested.
>
> Taking a lock is possible, the queries do get a lot more complicated, but
> I am worried about scalability. If a user has 200 friends I will have to
> take locks on their feeds. That means that if you have two friends that are
> posting items that should end up in your feed at the same time that a
> deadlock is going to occur. In certain networks (like twitter) some users
> have 100,000 followers. In a scenario such as this I would have to take a
> lock on 100,000 nodes. This means that while the user that has 100,000
> friends is posting no one else that is friends with any of those 100,000
> people will be able to post without getting a deadlock. This does not seem
> to be scalable.
>
> So should I go with the very simple approach where I have to do an
> order-by and posts are inserted like so?
>
> (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})
>
> This has a more expensive feed query but I don't have to worry about locks
> and linked lists and such. the queries are a lot simpler.
>
> What do you people think?
>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Social network design

Reply via email to