Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Michael Hunger Wed, 15 Oct 2014 05:05:10 -0700

Hi,

from the profiling it seems that Cypher selects the wrong pattern matcher
if we separate the node-lookup and path-match.


profile
 MATCH  p = (n:Topic)-[*0..2]-(m:Topic)
 where n.name = 'Topic1' and m.name = 'Topic2'
 return p, reduce(totProximity = 0, n IN relationships(p)| totProximity +
n.proximity) AS pathProximity
 order by pathProximity DESC  LIMIT 6;


+------------------+------+--------+-------------+-------------------------------------------------------------------+
|         Operator | Rows | DbHits | Identifiers |
                                    Other |
+------------------+------+--------+-------------+-------------------------------------------------------------------+
|     ColumnFilter |    0 |      0 |             |
            keep columns p, pathProximity |
|              Top |    0 |      0 |             |                   {
 AUTOINT3}; Cached(pathProximity of type Any) |
|          Extract |    0 |      0 |             |
                            pathProximity |
|      ExtractPath |    0 |      0 |           p |
                                          |
|           Filter |    0 |      0 |             | (hasLabel(m:Topic(0))
AND Property(m,name(1)) == {  AUTOSTRING1}) |
| TraversalMatcher |    0 |      1 |             |
                        m,   UNNAMED20, m |
+------------------+------+--------+-------------+-------------------------------------------------------------------+

On Wed, Oct 15, 2014 at 11:00 AM, gg4u <[email protected]> wrote:

> Hi Micheal,
>
> your aggregation was only on the same paths, so you get 9 different paths
>> but you didn't show the counts per path.
>>
>
> not clear to me yet; I am gonna post results for each query you suggested
> to try out.
>
> Rodger, to summarize a description of this test:
> 4M nodes labeled 'Topic'
> 100M rels (weighted)
> Index on Topic(name) > 'is a string type property for each node'
> 'Topic' dominates all dataset and this will be a subgraph of a larger
> network (if we I can set this in production time, a next step will have a
> graph of 85M nodes, ~2B rels, with same type of structure putting
> properties as nodes' properties and not decoupling to other nodes). So this
> is a primary, real case test, to see if it is feasible using Neo4j
> datastructure Vs NoSQL.
> And I'd love the answer be yes :D
>
> Micheal, here another test with other topics (I think not cached):
>
> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = '
> *Topic100*' and m.name = '*Topic2*' with p, n, m return p, count(*) order
> by count(*);
>
> results:
> ==>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> ==> | p
>
>
>                       | count(*) |
> ==>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> ==> |
> [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[10618620]{proximity:90},Node[3528892]{id:411782,name:"Topic101"},:P_Topic_Link[1025954]{proximity:68},Node[1386672]{id:21245,name:"Topic2"}]
>                                               | 1        |
> ==> |
> [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[2424845]{proximity:91},Node[3719110]{id:52502,name:"Topic102"},:P_Topic_Link[1025923]{proximity:85},Node[1386672]{id:21245,name:"Topic2"}]
>                    | 1        |
> ==> |
> [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[100682940]{proximity:19},Node[3461206]{id:39782569,name:"Topic103"},:P_Topic_Link[100682931]{proximity:107},Node[1386672]{id:21245,name:"Topic2"}]
>            | 1        |
> ==> |
> [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[21653222]{proximity:82},Node[706102]{id:1551073,name:"Topic104"},:P_Topic_Link[21653218]{proximity:87},Node[1386672]{id:21245,name:"Topic2"}]
>                                 | 1        |
>
> (.... results ...)
>
> ==>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
> ==> *67 rows*
> ==>* 3900775 ms*
>
>
>
> Il giorno martedì 14 ottobre 2014 22:54:43 UTC+2, Michael Hunger ha
> scritto:
>>
>> How many rows does this return?
>>
>> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic1'
>> and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);
>>
>> your aggregation was only on the same paths, so you get 9 different paths
>> but you didn't show the counts per path.
>>
>
>>
>
>> and obtain 9 rows in 182799 ms
>>
>> On Tue, Oct 14, 2014 at 10:59 AM, gg4u <[email protected]> wrote:
>>
>>> Yes:
>>>
>>> neo4j-sh (?)$ profile  MATCH (n:Topic), (m:Topic) where n.name =
>>> 'Topic1' and m.name = 'Topic2'  MATCH  p = (n)-[*0..2]-(m) return p,
>>> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity)
>>> AS pathProximity  order by pathProximity DESC  LIMIT 6;
>>> ==>
>>> [...results...]
>>> ==> 6 rows
>>> ==>
>>> ==> ColumnFilter
>>> ==>   |
>>> ==>   +Top
>>> ==>     |
>>> ==>     +Extract
>>> ==>       |
>>> ==>       +ExtractPath
>>> ==>         |
>>> ==>         +PatternMatcher
>>> ==>           |
>>> ==>           +SchemaIndex(0)
>>> ==>             |
>>> ==>             +SchemaIndex(1)
>>> ==>
>>> ==> +----------------+------+--------+-------------------+------
>>> -------------------------------------------+
>>> ==> |       Operator | Rows | DbHits |       Identifiers |
>>>                             Other |
>>> ==> +----------------+------+--------+-------------------+------
>>> -------------------------------------------+
>>> ==> |   ColumnFilter |    6 |      0 |                   |
>>>     keep columns p, pathProximity |
>>> ==> |            Top |    6 |      0 |                   | {  AUTOINT3};
>>> Cached(pathProximity of type Any) |
>>> ==> |        Extract |    9 |     36 |                   |
>>>                     pathProximity |
>>> ==> |    ExtractPath |    9 |      0 |                 p |
>>>                                   |
>>> ==> | PatternMatcher |    9 |      0 | n, m,   UNNAMED94 |
>>>                                   |
>>> ==> | SchemaIndex(0) |    1 |      2 |              m, m |
>>>     {  AUTOSTRING1}; :Topic(name) |
>>> ==> | SchemaIndex(1) |    1 |      2 |              n, n |
>>>     {  AUTOSTRING0}; :Topic(name) |
>>> ==> +----------------+------+--------+-------------------+------
>>> -------------------------------------------+
>>> ==>
>>> neo4j-sh (?)$
>>>
>>>
>>>
>>> Il giorno martedì 14 ottobre 2014 10:00:29 UTC+2, Michael Hunger ha
>>> scritto:
>>>>
>>>> Can you try this:
>>>>
>>>> profile
>>>> MATCH (n:Topic), (m:Topic)
>>>>  where n.name = 'Topic1' and m.name = 'Topic2'
>>>> MATCH  p = (n)-[*0..2]-(m)
>>>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity
>>>> + n.proximity) AS pathProximity
>>>> order by pathProximity DESC
>>>> LIMIT 6
>>>>
>>>>
>>>>
>>>> On Tue, Oct 14, 2014 at 9:06 AM, gg4u <[email protected]> wrote:
>>>>
>>>>> Hi Rodjer,
>>>>>
>>>>> thank you for your insights!
>>>>> please see comments below:
>>>>>
>>>>> Il giorno lunedì 13 ottobre 2014 18:37:50 UTC+2, Rodger ha scritto:
>>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> I've done a lot of RDBMS performance tuning.
>>>>>> Just a few quick thoughts.
>>>>>>
>>>>>>
>>>>>> Be sure to run the queries in the shell, if you are not already doing
>>>>>> so.
>>>>>>
>>>>>>
>>>>> Yes, they are run in the shell:
>>>>> http://localhost:7474/webadmin/#/console/
>>>>>
>>>>>
>>>>>> How many rows are returned? Just sorting, then returning many rows,
>>>>>> takes a long time to scroll them to output.
>>>>>>
>>>>>>
>>>>>>
>>>>> 9 rows
>>>>> In the answer above, I wrote 9 paths
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> If you are getting duplicates, it may be the equivalent of a
>>>>>> cartesian product,
>>>>>> one of the worst things that can happen in RDBMS, and also one
>>>>>> of the least known. See my presentation on them here:
>>>>>> http://rodgersnotes.wordpress.com/2010/09/15/stamping-out-ca
>>>>>> rtesian-products/
>>>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F15%2Fstamping-out-cartesian-products%2F&sa=D&sntz=1&usg=AFQjCNHJDOJ0IOsI6XRsg_9yuTscI4mqtQ>
>>>>>>
>>>>>
>>>>> So I had a look at your pdf,
>>>>> http://rodgersnotes.files.wordpress.com/2010/09/cartprodwordpress.pdf
>>>>> page 11
>>>>>
>>>>> and I think the idea you want to suggest, is to avoid duplicates (you
>>>>> called them 'cartesian products') by enforcing conditions.
>>>>> Though, since it is a graph db and not relational, not clear to me
>>>>> where this applies because in the graph db I don't have 'jointed' queries
>>>>> between tables,
>>>>> so the conditions I have are, at least in my case, properties (index
>>>>> on properties), and no-directional rels.
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Try:
>>>>>>
>>>>>> return p, count (*)
>>>>>> order by count(*)
>>>>>>
>>>>>
>>>>> I run:
>>>>>
>>>>> profile MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name
>>>>> = 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*)
>>>>> order by count(*);
>>>>>
>>>>> and I've got: (see there are also duplicates in paths: is it because I
>>>>> have both (a)-[]->(b) and (a)<-[]-(b) ?)
>>>>>
>>>>> ==> +-----------------------------------------------------------
>>>>> ------------------------------------------------------------
>>>>> ------------------------------------------------------------
>>>>> ---------------------------------------------------------------------+
>>>>> ==> | p
>>>>>
>>>>>
>>>>>                     | count(*) |
>>>>> ==> +-----------------------------------------------------------
>>>>> ------------------------------------------------------------
>>>>> ------------------------------------------------------------
>>>>> ---------------------------------------------------------------------+
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[711852
>>>>> 98]{proximity:68},Node[1401899]{id:21375850,name:"Topic3"},:
>>>>> P_Topic_Link[71185313]{proximity:32},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>                   | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[886757
>>>>> 19]{proximity:28},Node[2594397]{id:31760062,name:"Topic4"},:
>>>>> P_Topic_Link[88675745]{proximity:23},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>           | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[307360
>>>>> 00]{proximity:32},Node[2515502]{id:3106745,name:"Topic5"},:
>>>>> P_Topic_Link[30735974]{proximity:82},Node[1386672]{id:21245,name:"Topic2"}]
>>>>> | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[682063
>>>>> 83]{proximity:72},Node[1202629]{id:19635605,name:"Topic6"},:
>>>>> P_Topic_Link[68206440]{proximity:32},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>              | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[988981
>>>>> 73]{proximity:23},Node[3329750]{id:38567205,name:"Topic7"},:
>>>>> P_Topic_Link[98898126]{proximity:124},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>                        | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[581077
>>>>> 55]{proximity:55},Node[506613]{id:13841207,name:"Topic8"},:
>>>>> P_Topic_Link[58107766]{proximity:27},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>                             | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[988981
>>>>> 73]{proximity:23},Node[3329750]{id:38567205,name:"Topic7"},:
>>>>> P_Topic_Link[1025873]{proximity:124},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>                         | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[566262
>>>>> 6]{proximity:47},Node[736816]{id:157427,name:"Topic9"},:P_Topic_Link[
>>>>> 5662565]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>            | 1        |
>>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[566262
>>>>> 6]{proximity:47},Node[736816]{id:157427,name:"Topic9"},:P_Topic_Link[
>>>>> 1025864]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>>>>>            | 1        |
>>>>> ==> +-----------------------------------------------------------
>>>>> ------------------------------------------------------------
>>>>> ------------------------------------------------------------
>>>>> ---------------------------------------------------------------------+
>>>>> ==> 9 rows
>>>>> ==>
>>>>> ==> ColumnFilter(0)
>>>>> ==>   |
>>>>> ==>   +Sort
>>>>> ==>     |
>>>>> ==>     +EagerAggregation
>>>>> ==>       |
>>>>> ==>       +ColumnFilter(1)
>>>>> ==>         |
>>>>> ==>         +ExtractPath
>>>>> ==>           |
>>>>> ==>           +Filter
>>>>> ==>             |
>>>>> ==>             +TraversalMatcher
>>>>> ==>
>>>>> ==> +------------------+---------+---------+-------------+------
>>>>> ------------------------------------------------------------
>>>>> ----------------+
>>>>> ==> |         Operator |    Rows |  DbHits | Identifiers |
>>>>>                                                                Other |
>>>>> ==> +------------------+---------+---------+-------------+------
>>>>> ------------------------------------------------------------
>>>>> ----------------+
>>>>> ==> |  ColumnFilter(0) |       9 |       0 |             |
>>>>>                                             keep columns p, count(*) |
>>>>> ==> |             Sort |       9 |       0 |             | Cached(
>>>>>  INTERNAL_AGGREGATE931614f3-4def-4fc4-a80b-c6fca3839817 of type
>>>>> Integer) |
>>>>> ==> | EagerAggregation |       9 |       0 |             |
>>>>>                                                                    p |
>>>>> ==> |  ColumnFilter(1) |       9 |       0 |             |
>>>>>                                                 keep columns p, n, m |
>>>>> ==> |      ExtractPath |       9 |       0 |           p |
>>>>>                                                                      |
>>>>> ==> |           Filter |       9 | 3032385 |             |
>>>>>    (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
>>>>> ==> | TraversalMatcher | 1010795 | 1024307 |             |
>>>>>                                                    m,   UNNAMED36, m |
>>>>> ==> +------------------+---------+---------+-------------+------
>>>>> ------------------------------------------------------------
>>>>> ----------------+
>>>>> ==>
>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Without me looking at the raw data, and the query result, you
>>>>>> seem to have many operations going on. So, you have a lot of rows in
>>>>>> the profile output.
>>>>>>
>>>>>
>>>>> Only 9
>>>>>
>>>>>
>>>>>>  As a general rule, the more rows there are in the
>>>>>> profile, the slower the response time is.
>>>>>> ie. the more complex the query, the slower it is.
>>>>>>
>>>>>>
>>>>>> If I were looking at this, I would try to isolate which part of
>>>>>> the query is the slow part.  The Return clause, or the Match clause?
>>>>>>
>>>>>>
>>>>>> You've already tried the response times with the data.
>>>>>> Try to simply:
>>>>>> return count(*) .
>>>>>>
>>>>>
>>>>> I run:
>>>>> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name =
>>>>> 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order
>>>>> by count(*);
>>>>>
>>>>> and obtain 9 rows in 182799 ms
>>>>>
>>>>> I run:
>>>>> MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name =
>>>>> 'Topic2' with n, m return count(*);
>>>>>
>>>>> and obtain 856ms
>>>>>
>>>>>
>>>>> profile MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name
>>>>> = 'Topic2' with n, m return count(*);
>>>>>
>>>>> results in:
>>>>>
>>>>>
>>>>> ==> ColumnFilter
>>>>> ==>   |
>>>>> ==>   +EagerAggregation
>>>>> ==>     |
>>>>> ==>     +SchemaIndex(0)
>>>>> ==>       |
>>>>> ==>       +SchemaIndex(1)
>>>>> ==>
>>>>> ==> +------------------+------+--------+-------------+----------
>>>>> ---------------------+
>>>>> ==> |         Operator | Rows | DbHits | Identifiers |
>>>>>         Other |
>>>>> ==> +------------------+------+--------+-------------+----------
>>>>> ---------------------+
>>>>> ==> |     ColumnFilter |    1 |      0 |             |         keep
>>>>> columns count(*) |
>>>>> ==> | EagerAggregation |    1 |      0 |             |
>>>>>               |
>>>>> ==> |   SchemaIndex(0) |    1 |      2 |        m, m | {
>>>>>  AUTOSTRING1}; :Topic(name) |
>>>>> ==> |   SchemaIndex(1) |    1 |      2 |        n, n | {
>>>>>  AUTOSTRING0}; :Topic(name) |
>>>>> ==> +------------------+------+--------+-------------+----------
>>>>> ---------------------+
>>>>>
>>>>>
>>>>>> How many seconds response time is that, versus the original query?
>>>>>> What is the resulting profile?
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>> So, it looks like it actually take huge time in traversing the graph,
>>>>> while reasonable time '~900ms' to match a fullstring node.
>>>>>
>>>>> *Any idea for improving performance of traversal??*
>>>>>
>>>>> *It is a real problem, since also for getting results of first
>>>>> neighbors of a node, I met the same problem which makes currently
>>>>> unfeasible for production :*
>>>>> *Anyone with real case of similar size graph and structure trying to
>>>>> perform a similar query?*
>>>>>
>>>>> as example, this query to obtain first neighbors of node Topic44:
>>>>>
>>>>> MATCH (n:Topic) , (m), p = (n)-[*0..1]-(m)
>>>>> where n.name = 'Topic44'
>>>>> with p, n, m
>>>>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity
>>>>> + n.proximity) AS pathProximity order by pathProximity DESC LIMIT 6
>>>>>
>>>>> returns
>>>>> 6 rows in ~65000 ms VS 6 rows in less than a second with a NoSQL.
>>>>>
>>>>> Any idea?
>>>>>
>>>>> thank you guys for helping!! Hope to find a solution soon..
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> See also the tuning presentations I've done:
>>>>>> http://rodgersnotes.wordpress.com/2010/09/14/oracle-performa
>>>>>> nce-tuning/
>>>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F14%2Foracle-performance-tuning%2F&sa=D&sntz=1&usg=AFQjCNE0XK_XcNk5YBj806h6a1OJHr0glA>
>>>>>> http://rodgersnotes.wordpress.com/2014/06/08/tuning-the-untu
>>>>>> nable-when-indexes-and-optimizer-dont-help-2/
>>>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2014%2F06%2F08%2Ftuning-the-untunable-when-indexes-and-optimizer-dont-help-2%2F&sa=D&sntz=1&usg=AFQjCNFgTfu5bnjPw6boHWttJpzQBtaNgw>
>>>>>> They are quick reads.
>>>>>>
>>>>>> thank you, seen them,
>>>>> they are about SQL tuning mostly:
>>>>> I've just used neo4j strucutre to store a graph with same label on 4M
>>>>> topics (I MUST keep it with one label), index on topic(name) property and
>>>>> used cypher to query the db,
>>>>> this is my data structure.
>>>>>
>>>>> I've put a number of principles and principles in there, that you
>>>>>> might apply.
>>>>>> ie. Could you create the NEO4J equivalent of a temp table?
>>>>>>
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>>
>>>>>> On Thursday, October 9, 2014 2:41:47 AM UTC-5, gg4u wrote:
>>>>>>>
>>>>>>> Hi Micheal, thank you.
>>>>>>> sure I post my profile result here below !
>>>>>>>
>>>>>>>
>>>>>>>>  --
>>>>> You received this message because you are subscribed to the Google
>>>>> Groups "Neo4j" group.
>>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>>> an email to [email protected].
>>>>> For more options, visit https://groups.google.com/d/optout.
>>>>>
>>>>
>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Reply via email to