Re: [Neo4j] Re: Social network design

Aran Mulholland Tue, 01 Jul 2014 16:19:37 -0700

Could you explain what you mean by "Cypher is not the best fit for handling
those long chains"


Are you saying that to solve this problem properly I am going to have to
write some Java?

Cypher gives me the flexibility to code my service layer in the platform of
choice (right now that's node.js) I want to communicate with the graph db
over http and do not have any idea about how to extend neo4j to allow
queries based on the extension you mention (i'm not a java guy) over http.

As for the second part of your statement, are you referring to your
extension or are you speaking in general terms. i.e. Are you saying that
concurrent writes should never result in the branching of a linked list?


On Tue, Jul 1, 2014 at 11:04 PM, Michael Hunger <
[email protected]> wrote:

> Aran,
>
> right now Cypher is not the best fit for handling those long chains, I set
> up a example repository for a use case similar to this, that implements it
> with an unmanaged extension.
>
> See here: https://github.com/jexp/neo4j-activity-stream
>
> Nodes that have relationships added to and removed from are locked
> automatically.
> Only if as part of a single operation on your list you grab multiple nodes
> to add/remove different relationships, you might end up with a
> #1 deadlock where you have to retry your operation
> #2 in certain conditions when your nodes don't overlap initially and the
> first tx finishes while the second is still working, then you might end up
> with changing something twice. (Would have to think about a scenario that
> represents this).
>
> HTH
>
> Cheers, Michael
>
>
> On Tue, Jul 1, 2014 at 2:59 PM, Aran Mulholland <[email protected]>
> wrote:
>
>> Here is the gist:
>>
>> http://gist.neo4j.org/?8d55c23a84d9c94dd4f2
>>
>> It did not end up being 1:1 but it's pretty close. I fixed some stuff on
>> the way so the queries and data was a bit tighter.
>>
>> I would really like some more thoughts on this. At the moment I do not
>> have a scalable solution for using Neo4j to build a service like twitter.
>> The cookbook examples (which my research has been based on) are a bit naive
>> and would not work well in practice. I would love to hear how others have
>> solved this problem.
>>
>> Aseem, I set up a repository to do my testing on the linked list
>> problems. (https://github.com/aranm/neo4j-linked-list) In the end I got
>> a linked list implementation that would not branch. But it's got that evil
>> lock...
>>
>> I had some help from the mailing list and also on
>> http://stackoverflow.com/questions/21692829/neo4j-how-do-you-do-parallel-insertion-into-a-linked-list-via-cypher-over-rest
>>
>> But I definitely saw branching. (Somewhere I have a screen dump, and it
>> is not to hard to put together a project that will replicate the problem,
>> the github repository does if you care to look through the history)
>>
>> From my understanding (and I could be wrong) I think that when you add a
>> relationship to a node it does not necessarily mean that you have to lock
>> that node so no other relationship can be added at the same time. If you
>> want to lock you need to do some funky muddling around.
>>
>> Can you further explain how you got around the problem in your scenario,
>> especially what you mean by 'we're going to attack that in practice via
>> external caching for now.'
>>
>> I am toying with the idea of placing the feed items in table storage. But
>> it seems a pity to me to have to use two data storage solutions.
>>
>>
>>
>>
>> On Mon, Jun 30, 2014 at 2:15 PM, Aseem Kishore <[email protected]>
>> wrote:
>>
>>> Fantastic post.
>>>
>>> It's rare to see such thorough and fully thought-out posts in the public
>>> IMHO. A lot of people are doing very fine work with Neo4j and Cypher; the
>>> really robust stuff just doesn't tend to see the light of the day, I've
>>> found.
>>>
>>> E.g. You implement pagination robustly, filtering (WHERE) by time. And
>>> you astutely discovered that you need to have unique nodes in everyone's
>>> feed, otherwise you'll get mixed feed paths. (You can add an "owner ID"
>>> type of property to the relationships, but that's less efficient than just
>>> using separate nodes. You could also index those relationships via those
>>> properties, but that's also a bigger headache.)
>>>
>>> Nice work again. So good to have something this thorough and accurate
>>> and *correct* publicly documented.
>>>
>>> I heartily echo Michael's suggestion to make this into a proper post
>>> somewhere, whether as a Graph Gist or your own blog post or something else.
>>> I'd definitely love to point people to it! (And Google Groups just doesn't
>>> do this justice.)
>>>
>>> —
>>>
>>> To your actual q, I sort of agree with your identification of the
>>> problem — concurrent writes to the same users' linked lists — but I'm
>>> really surprised by your reported symptom — that your linked lists start
>>> branching / get two heads. I'm stunned. Have you actually seen that? Or are
>>> you just speculating?
>>>
>>> In my own experience, Neo4j throws a deadlock error in the case of these
>>> concurrent writes. The linked lists don't branch, because every Cypher
>>> query is an atomic transaction.
>>>
>>>  It's still a real pain to me that Neo4j even throws a deadlock error
>>> though. I want Neo4j to be smart enough to simply serialize those writes.
>>> (IOW, I don't understand why a deadlock needs to happen. Can't you e.g.
>>> order the locks by node ID? Of course, I'm ignorant of the implementation
>>> details here.) In practice, the workaround is to simply retry those writes.
>>> But that's shifting the burden from Neo4j to the caller.
>>>
>>> However, I haven't investigated/tested this deeply myself. We actually
>>> moved away from linked lists to simplify our own code and data model. While
>>> ORDER BY definitely doesn't scale well, we're going to attack that in
>>> practice via external caching for now.
>>>
>>> —
>>>
>>> A couple of minor corrections/improvements to your queries:
>>>
>>> - You say the feed contains the user's own posts, but your write query
>>> here doesn't do that. To fix, just make the "followers" match
>>> variable-length to include the own user: MATCH (followers)
>>> -[:FOLLOWS*0..1]-> (n).
>>>
>>> - I'm not sure how you set up your linked lists, but unless they're
>>> circular (linking back to the user), you may need to be robust to the list
>>> being empty initially. E.g. in the write query, change the MATCH
>>> (followers) -[oldFeed:FEED]-> (after) to an OPTIONAL MATCH instead. But
>>> it's trickier than that...
>>>
>>> Hope this helps!
>>>
>>> Really fantastic post again, and looking forward to hopefully seeing it
>>> somewhere better than here. =)
>>>
>>> Aseem
>>>
>>>
>>> On Saturday, June 28, 2014 6:02:00 PM UTC-4, Aran Mulholland wrote:
>>>>
>>>> Hi all,
>>>>
>>>>
>>>> When creating a social network type application the feed is king, the
>>>> most common query that will be run against your service will be a request
>>>> to get the feed. There are of course many ways to model the feed and a
>>>> couple of components that are required.
>>>>
>>>> To do a very simple feed one might structure the data as follows:
>>>>
>>>> Users follow other users
>>>>
>>>> (n:USER)-[:FOLLOWS]->(m:USER)
>>>>
>>>> http://neo4j-console-20.herokuapp.com/?id=xp1hea
>>>>
>>>>
>>>> Users can then post stuff
>>>>
>>>> (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})
>>>>
>>>> http://neo4j-console-20.herokuapp.com/?id=4p5aui
>>>>
>>>>
>>>> The feed query then becomes:
>>>>
>>>> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
>>>> WHERE n.id = 1
>>>> RETURN m, p
>>>> ORDER BY p.time DESC
>>>>
>>>>  http://neo4j-console-20.herokuapp.com/?id=msje6d
>>>>
>>>> All this looks well and good, but there will be some problems. The feed
>>>> is generally returned in reverse date order. When browsing twitter for
>>>> example we see a list of tweets with the most recent tweet at the top. To
>>>> load the whole feed for a user would be ridiculous so what happens is the
>>>> top 20-50 tweets are loaded and when you scroll down more are loaded as you
>>>> need them. In order to achieve this with the data structure outlined above
>>>> one would have to structure the queries as follows. I'm going to return 2
>>>> at a time as I don't have that many in the db)
>>>>
>>>> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
>>>> WHERE n.id = 1
>>>> RETURN m, p
>>>> ORDER BY p.time DESC
>>>> LIMIT 2
>>>>
>>>> http://neo4j-console-20.herokuapp.com/r/ggclr
>>>>
>>>> In order to get the next 2 after this we have to match where time is
>>>> greater than the last post returned by the previous query
>>>>
>>>> MATCH (n:USER)-[:FOLLOWS]->(m:USER)-[:POSTS]-(p:POST)
>>>> WHERE n.id = 1 AND p.time < 1392464619659 //this time stamp is the
>>>> last one returned
>>>> RETURN m, p
>>>> ORDER BY p.time DESC
>>>> LIMIT 2
>>>>
>>>> http://neo4j-console-20.herokuapp.com/?id=ggclr
>>>>
>>>> So there we have a basic feed, it has a few issues however. The order
>>>> by clause is never cheap. I don't know the internals and how well they are
>>>> optimised but could imagine that in the worst case scenario the query would
>>>> have to fetch every single post, order the lot of them and then return a
>>>> subset. As the feed query will be the most important query we will probably
>>>> need a more efficient way to get this information. Enter the linked list.
>>>>
>>>> Instead of each post only being associated with the user who made the
>>>> post we will create a linked list holding the feed for every user.
>>>> Initially structured like so:
>>>>
>>>> (n:USER)-[:FEED]->(n)
>>>>
>>>> http://neo4j-console-20.herokuapp.com/r/sa5kxm
>>>>
>>>> Whenever a user posts we find all of the users followers and add the
>>>> post into each of their feeds. In other words we do the most work doing the
>>>> insertion. The insertion is expensive, the feed request should be
>>>> inexpensive. So when a post is added we add an item to each of the linked
>>>> lists of the users that are following the poster. Confused? Heres the code.
>>>>
>>>> Inserting a new post gets a little more complicated
>>>>
>>>>
>>>> MATCH (n:USER)
>>>> WHERE n.id = 4
>>>> CREATE (n)-[:POSTS]->(p:POST { time: TIMESTAMP(), content: 'Here is the
>>>> post' })
>>>> WITH n, p
>>>> MATCH followers-[:FOLLOWS]->n
>>>> WITH followers, n, p
>>>> MATCH followers-[oldFeed:FEED]->after
>>>> CREATE (feedItem:FEEDITEM)-[:FEEDPOST]->p
>>>> CREATE followers-[:FEED]->feedItem-[:FEED]->after
>>>> DELETE oldFeed
>>>>
>>>>
>>>> Here we have the linked list insertion method when inserting at the
>>>> head of the list.
>>>>
>>>> First we match the head node
>>>>
>>>> MATCH followers-[oldFeed:FEED]->after
>>>>
>>>> Then we insert a new node
>>>>
>>>> CREATE followers-[:FEED]->(newNode:FEEDITEM)-[:FEED]->after
>>>>
>>>> Then we delete the old link
>>>>
>>>> DELETE oldFeed
>>>>
>>>>
>>>> The reason the structure does not look like this
>>>>
>>>> (n:USER)-[:FEED]->(p:POST)-[:FEED]->after
>>>>
>>>> Is that the post will appear in multiple users feeds so we need a
>>>> pseudo feed item that is specific to each user otherwise when we query the
>>>> feed the traversal will end up getting mixed through a  heap of different
>>>> users feeds.
>>>>
>>>>
>>>> Here is a graph with a few posts inserted:
>>>>
>>>> http://neo4j-console-20.herokuapp.com/?id=otayz5
>>>>
>>>> So now we come to the all important feed query, it ends up looking like
>>>> this
>>>>
>>>>
>>>> MATCH (n:USER)
>>>> WHERE n.id = 1
>>>> MATCH n-[:FEED*]->(feedItem)
>>>> WHERE feedItem <> n
>>>> WITH feedItem
>>>>  MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster
>>>> RETURN post, poster
>>>>
>>>> http://neo4j-console-20.herokuapp.com/r/rnne41
>>>>
>>>> Now if we want to limit the number of items that are returned by the
>>>> query, we limit the number of relationships we follow like so (if we want
>>>> the top 3 items)
>>>>
>>>> MATCH (n:USER)
>>>> WHERE n.id = 1
>>>> MATCH n-[:FEED*0..3]->(feedItem)
>>>> WHERE feedItem <> n
>>>> WITH feedItem
>>>> MATCH feedItem-[:FEEDPOST]->post<-[:POSTS]-poster
>>>> RETURN post, poster
>>>>
>>>> http://neo4j-console-20.herokuapp.com/r/gfu2la
>>>>
>>>> This system works a lot better, the feed items are already ordered, so
>>>> we do not need the ORDER BY clause, the feed contains the users own posts
>>>> so they are included as well. And because the query only follows one
>>>> straight list of relations one would assume it was faster.
>>>>
>>>> Enter the demon in the corner. In order to have Neo4j (or any other
>>>> storage mechanism) add entries to a linked list you need to take a lock on
>>>> the node you are adding the next linked node to. If you do not and you get
>>>> two inserts happening at the same time your linked list will branch with
>>>> two nodes coming off the one node. I have done pretty extensive testing on
>>>> this, the posts (as well as the findings) relating to this were asked on
>>>> this mailing list if you are interested.
>>>>
>>>> Taking a lock is possible, the queries do get a lot more complicated,
>>>> but I am worried about scalability. If a user has 200 friends I will have
>>>> to take locks on their feeds. That means that if you have two friends that
>>>> are posting items that should end up in your feed at the same time that a
>>>> deadlock is going to occur. In certain networks (like twitter) some users
>>>> have 100,000 followers. In a scenario such as this I would have to take a
>>>> lock on 100,000 nodes. This means that while the user that has 100,000
>>>> friends is posting no one else that is friends with any of those 100,000
>>>> people will be able to post without getting a deadlock. This does not seem
>>>> to be scalable.
>>>>
>>>> So should I go with the very simple approach where I have to do an
>>>> order-by and posts are inserted like so?
>>>>
>>>> (n:USER)-[:POST]->(p:POST {content: content, time: TIMESTAMP()})
>>>>
>>>> This has a more expensive feed query but I don't have to worry about
>>>> locks and linked lists and such. the queries are a lot simpler.
>>>>
>>>> What do you people think?
>>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Re: Social network design

Reply via email to