Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

gg4u Wed, 15 Oct 2014 02:01:02 -0700

Hi Micheal, 

your aggregation was only on the same paths, so you get 9 different paths 
> but you didn't show the counts per path. 
>


not clear to me yet; I am gonna post results for each query you suggested 
to try out.

Rodger, to summarize a description of this test:
4M nodes labeled 'Topic'
100M rels (weighted)
Index on Topic(name) > 'is a string type property for each node'
'Topic' dominates all dataset and this will be a subgraph of a larger 
network (if we I can set this in production time, a next step will have a 
graph of 85M nodes, ~2B rels, with same type of structure putting 
properties as nodes' properties and not decoupling to other nodes). So this 
is a primary, real case test, to see if it is feasible using Neo4j 
datastructure Vs NoSQL.
And I'd love the answer be yes :D

Micheal, here another test with other topics (I think not cached):

MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = '*Topic100*' 
and m.name = '*Topic2*' with p, n, m return p, count(*) order by count(*);

results:
==> 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | p                                                                     
                                                                            
                                                                            
                    | count(*) |
==> 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> | 
[Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[10618620]{proximity:90},Node[3528892]{id:411782,name:"Topic101"},:P_Topic_Link[1025954]{proximity:68},Node[1386672]{id:21245,name:"Topic2"}]
 
                                              | 1        |
==> | 
[Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[2424845]{proximity:91},Node[3719110]{id:52502,name:"Topic102"},:P_Topic_Link[1025923]{proximity:85},Node[1386672]{id:21245,name:"Topic2"}]
 
                   | 1        |
==> | 
[Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[100682940]{proximity:19},Node[3461206]{id:39782569,name:"Topic103"},:P_Topic_Link[100682931]{proximity:107},Node[1386672]{id:21245,name:"Topic2"}]
 
           | 1        |
==> | 
[Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[21653222]{proximity:82},Node[706102]{id:1551073,name:"Topic104"},:P_Topic_Link[21653218]{proximity:87},Node[1386672]{id:21245,name:"Topic2"}]
 
                                | 1        |

(.... results ...)
 
==> 
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
==> *67 rows*
==>* 3900775 ms*



Il giorno martedì 14 ottobre 2014 22:54:43 UTC+2, Michael Hunger ha scritto:
>
> How many rows does this return?
>
> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic1' 
> and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);
>
> your aggregation was only on the same paths, so you get 9 different paths 
> but you didn't show the counts per path. 
>
 
>

> and obtain 9 rows in 182799 ms
>
> On Tue, Oct 14, 2014 at 10:59 AM, gg4u <[email protected] <javascript:>> 
> wrote:
>
>> Yes:
>>
>> neo4j-sh (?)$ profile  MATCH (n:Topic), (m:Topic) where n.name = 
>> 'Topic1' and m.name = 'Topic2'  MATCH  p = (n)-[*0..2]-(m) return p, 
>> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
>> AS pathProximity  order by pathProximity DESC  LIMIT 6;
>> ==> 
>> [...results...]
>> ==> 6 rows
>> ==> 
>> ==> ColumnFilter
>> ==>   |
>> ==>   +Top
>> ==>     |
>> ==>     +Extract
>> ==>       |
>> ==>       +ExtractPath
>> ==>         |
>> ==>         +PatternMatcher
>> ==>           |
>> ==>           +SchemaIndex(0)
>> ==>             |
>> ==>             +SchemaIndex(1)
>> ==> 
>> ==> 
>> +----------------+------+--------+-------------------+-------------------------------------------------+
>> ==> |       Operator | Rows | DbHits |       Identifiers |               
>>                             Other |
>> ==> 
>> +----------------+------+--------+-------------------+-------------------------------------------------+
>> ==> |   ColumnFilter |    6 |      0 |                   |               
>>     keep columns p, pathProximity |
>> ==> |            Top |    6 |      0 |                   | {  AUTOINT3}; 
>> Cached(pathProximity of type Any) |
>> ==> |        Extract |    9 |     36 |                   |               
>>                     pathProximity |
>> ==> |    ExtractPath |    9 |      0 |                 p |               
>>                                   |
>> ==> | PatternMatcher |    9 |      0 | n, m,   UNNAMED94 |               
>>                                   |
>> ==> | SchemaIndex(0) |    1 |      2 |              m, m |               
>>     {  AUTOSTRING1}; :Topic(name) |
>> ==> | SchemaIndex(1) |    1 |      2 |              n, n |               
>>     {  AUTOSTRING0}; :Topic(name) |
>> ==> 
>> +----------------+------+--------+-------------------+-------------------------------------------------+
>> ==> 
>> neo4j-sh (?)$ 
>>
>>
>>
>> Il giorno martedì 14 ottobre 2014 10:00:29 UTC+2, Michael Hunger ha 
>> scritto:
>>>
>>> Can you try this:
>>>
>>> profile 
>>> MATCH (n:Topic), (m:Topic)
>>>  where n.name = 'Topic1' and m.name = 'Topic2' 
>>> MATCH  p = (n)-[*0..2]-(m)
>>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
>>> n.proximity) AS pathProximity 
>>> order by pathProximity DESC 
>>> LIMIT 6
>>>
>>>
>>>
>>> On Tue, Oct 14, 2014 at 9:06 AM, gg4u <[email protected]> wrote:
>>>
>>>> Hi Rodjer,
>>>>
>>>> thank you for your insights!
>>>> please see comments below:
>>>>
>>>> Il giorno lunedì 13 ottobre 2014 18:37:50 UTC+2, Rodger ha scritto:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I've done a lot of RDBMS performance tuning.
>>>>> Just a few quick thoughts.
>>>>>
>>>>>
>>>>> Be sure to run the queries in the shell, if you are not already doing 
>>>>> so.
>>>>>
>>>>>
>>>> Yes, they are run in the shell:
>>>> http://localhost:7474/webadmin/#/console/
>>>>  
>>>>
>>>>> How many rows are returned? Just sorting, then returning many rows, 
>>>>> takes a long time to scroll them to output. 
>>>>>
>>>>>
>>>>>
>>>> 9 rows
>>>> In the answer above, I wrote 9 paths
>>>>
>>>>  
>>>>
>>>>>
>>>>> If you are getting duplicates, it may be the equivalent of a cartesian 
>>>>> product, 
>>>>> one of the worst things that can happen in RDBMS, and also one
>>>>> of the least known. See my presentation on them here:
>>>>> http://rodgersnotes.wordpress.com/2010/09/15/stamping-out-ca
>>>>> rtesian-products/ 
>>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F15%2Fstamping-out-cartesian-products%2F&sa=D&sntz=1&usg=AFQjCNHJDOJ0IOsI6XRsg_9yuTscI4mqtQ>
>>>>>
>>>>
>>>> So I had a look at your pdf,
>>>> http://rodgersnotes.files.wordpress.com/2010/09/cartprodwordpress.pdf
>>>> page 11
>>>>
>>>> and I think the idea you want to suggest, is to avoid duplicates (you 
>>>> called them 'cartesian products') by enforcing conditions.
>>>> Though, since it is a graph db and not relational, not clear to me 
>>>> where this applies because in the graph db I don't have 'jointed' queries 
>>>> between tables,
>>>> so the conditions I have are, at least in my case, properties (index on 
>>>> properties), and no-directional rels.
>>>>  
>>>>
>>>>>
>>>>>
>>>>> Try:
>>>>>
>>>>> return p, count (*) 
>>>>> order by count(*)
>>>>>
>>>>
>>>> I run:
>>>>
>>>> profile MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name 
>>>> = 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order 
>>>> by count(*);
>>>>
>>>> and I've got: (see there are also duplicates in paths: is it because I 
>>>> have both (a)-[]->(b) and (a)<-[]-(b) ?)
>>>>
>>>> ==> +-----------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ---------------------------------------------------------------------+
>>>> ==> | p                                                                 
>>>>                                                                            
>>>>  
>>>>                                                                            
>>>>  
>>>>                   | count(*) |
>>>> ==> +-----------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ---------------------------------------------------------------------+
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 71185298]{proximity:68},Node[1401899]{id:21375850,name:"
>>>> Topic3"},:P_Topic_Link[71185313]{proximity:32},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]                   | 1        |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 88675719]{proximity:28},Node[2594397]{id:31760062,name:"
>>>> Topic4"},:P_Topic_Link[88675745]{proximity:23},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]           | 1        |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 30736000]{proximity:32},Node[2515502]{id:3106745,name:"
>>>> Topic5"},:P_Topic_Link[30735974]{proximity:82},Node[
>>>> 1386672]{id:21245,name:"Topic2"}] | 1        |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 68206383]{proximity:72},Node[1202629]{id:19635605,name:"
>>>> Topic6"},:P_Topic_Link[68206440]{proximity:32},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]              | 1        |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 98898173]{proximity:23},Node[3329750]{id:38567205,name:"
>>>> Topic7"},:P_Topic_Link[98898126]{proximity:124},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]                        | 1        |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 58107755]{proximity:55},Node[506613]{id:13841207,name:"
>>>> Topic8"},:P_Topic_Link[58107766]{proximity:27},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]                             | 1     
>>>>    |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 98898173]{proximity:23},Node[3329750]{id:38567205,name:"
>>>> Topic7"},:P_Topic_Link[1025873]{proximity:124},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]                         | 1        |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 5662626]{proximity:47},Node[736816]{id:157427,name:"
>>>> Topic9"},:P_Topic_Link[5662565]{proximity:138},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]                  | 1        |
>>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>>> 5662626]{proximity:47},Node[736816]{id:157427,name:"
>>>> Topic9"},:P_Topic_Link[1025864]{proximity:138},Node[
>>>> 1386672]{id:21245,name:"Topic2"}]                  | 1        |
>>>> ==> +-----------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ------------------------------------------------------------
>>>> ---------------------------------------------------------------------+
>>>> ==> 9 rows
>>>> ==> 
>>>> ==> ColumnFilter(0)
>>>> ==>   |
>>>> ==>   +Sort
>>>> ==>     |
>>>> ==>     +EagerAggregation
>>>> ==>       |
>>>> ==>       +ColumnFilter(1)
>>>> ==>         |
>>>> ==>         +ExtractPath
>>>> ==>           |
>>>> ==>           +Filter
>>>> ==>             |
>>>> ==>             +TraversalMatcher
>>>> ==> 
>>>> ==> +------------------+---------+---------+-------------+------
>>>> ------------------------------------------------------------
>>>> ----------------+
>>>> ==> |         Operator |    Rows |  DbHits | Identifiers |             
>>>>                                                                Other |
>>>> ==> +------------------+---------+---------+-------------+------
>>>> ------------------------------------------------------------
>>>> ----------------+
>>>> ==> |  ColumnFilter(0) |       9 |       0 |             |             
>>>>                                             keep columns p, count(*) |
>>>> ==> |             Sort |       9 |       0 |             | Cached( 
>>>>  INTERNAL_AGGREGATE931614f3-4def-4fc4-a80b-c6fca3839817 of type 
>>>> Integer) |
>>>> ==> | EagerAggregation |       9 |       0 |             |             
>>>>                                                                    p |
>>>> ==> |  ColumnFilter(1) |       9 |       0 |             |             
>>>>                                                 keep columns p, n, m |
>>>> ==> |      ExtractPath |       9 |       0 |           p |             
>>>>                                                                      |
>>>> ==> |           Filter |       9 | 3032385 |             |             
>>>>    (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
>>>> ==> | TraversalMatcher | 1010795 | 1024307 |             |             
>>>>                                                    m,   UNNAMED36, m |
>>>> ==> +------------------+---------+---------+-------------+------
>>>> ------------------------------------------------------------
>>>> ----------------+
>>>> ==> 
>>>>
>>>>>
>>>>>
>>>>>
>>>>> Without me looking at the raw data, and the query result, you
>>>>> seem to have many operations going on. So, you have a lot of rows in 
>>>>> the profile output. 
>>>>>
>>>>
>>>> Only 9
>>>>  
>>>>
>>>>>  As a general rule, the more rows there are in the 
>>>>> profile, the slower the response time is. 
>>>>> ie. the more complex the query, the slower it is. 
>>>>>
>>>>>
>>>>> If I were looking at this, I would try to isolate which part of 
>>>>> the query is the slow part.  The Return clause, or the Match clause?
>>>>>
>>>>>
>>>>> You've already tried the response times with the data.
>>>>> Try to simply: 
>>>>> return count(*) .
>>>>>
>>>>
>>>> I run:
>>>> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 
>>>> 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order 
>>>> by count(*);
>>>>
>>>> and obtain 9 rows in 182799 ms
>>>>
>>>> I run:
>>>> MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name = 
>>>> 'Topic2' with n, m return count(*);
>>>>
>>>> and obtain 856ms
>>>>
>>>>
>>>> profile MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name 
>>>> = 'Topic2' with n, m return count(*);
>>>>
>>>> results in:
>>>>
>>>>
>>>> ==> ColumnFilter
>>>> ==>   |
>>>> ==>   +EagerAggregation
>>>> ==>     |
>>>> ==>     +SchemaIndex(0)
>>>> ==>       |
>>>> ==>       +SchemaIndex(1)
>>>> ==> 
>>>> ==> +------------------+------+--------+-------------+----------
>>>> ---------------------+
>>>> ==> |         Operator | Rows | DbHits | Identifiers |                 
>>>>         Other |
>>>> ==> +------------------+------+--------+-------------+----------
>>>> ---------------------+
>>>> ==> |     ColumnFilter |    1 |      0 |             |         keep 
>>>> columns count(*) |
>>>> ==> | EagerAggregation |    1 |      0 |             |                 
>>>>               |
>>>> ==> |   SchemaIndex(0) |    1 |      2 |        m, m | {  AUTOSTRING1}; 
>>>> :Topic(name) |
>>>> ==> |   SchemaIndex(1) |    1 |      2 |        n, n | {  AUTOSTRING0}; 
>>>> :Topic(name) |
>>>> ==> +------------------+------+--------+-------------+----------
>>>> ---------------------+
>>>>  
>>>>
>>>>> How many seconds response time is that, versus the original query? 
>>>>> What is the resulting profile?
>>>>>
>>>>>
>>>>>
>>>>
>>>> So, it looks like it actually take huge time in traversing the graph,
>>>> while reasonable time '~900ms' to match a fullstring node.
>>>>
>>>> *Any idea for improving performance of traversal??*
>>>>
>>>> *It is a real problem, since also for getting results of first 
>>>> neighbors of a node, I met the same problem which makes currently 
>>>> unfeasible for production :*
>>>> *Anyone with real case of similar size graph and structure trying to 
>>>> perform a similar query?*
>>>>
>>>> as example, this query to obtain first neighbors of node Topic44:
>>>>
>>>> MATCH (n:Topic) , (m), p = (n)-[*0..1]-(m)
>>>> where n.name = 'Topic44' 
>>>> with p, n, m
>>>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity 
>>>> + n.proximity) AS pathProximity order by pathProximity DESC LIMIT 6
>>>>
>>>> returns
>>>> 6 rows in ~65000 ms VS 6 rows in less than a second with a NoSQL.
>>>>
>>>> Any idea?
>>>>
>>>> thank you guys for helping!! Hope to find a solution soon..
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>> See also the tuning presentations I've done: 
>>>>> http://rodgersnotes.wordpress.com/2010/09/14/oracle-performa
>>>>> nce-tuning/ 
>>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F14%2Foracle-performance-tuning%2F&sa=D&sntz=1&usg=AFQjCNE0XK_XcNk5YBj806h6a1OJHr0glA>
>>>>> http://rodgersnotes.wordpress.com/2014/06/08/tuning-the-untu
>>>>> nable-when-indexes-and-optimizer-dont-help-2/ 
>>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2014%2F06%2F08%2Ftuning-the-untunable-when-indexes-and-optimizer-dont-help-2%2F&sa=D&sntz=1&usg=AFQjCNFgTfu5bnjPw6boHWttJpzQBtaNgw>
>>>>> They are quick reads. 
>>>>>
>>>>> thank you, seen them, 
>>>> they are about SQL tuning mostly:
>>>> I've just used neo4j strucutre to store a graph with same label on 4M 
>>>> topics (I MUST keep it with one label), index on topic(name) property and 
>>>> used cypher to query the db,
>>>> this is my data structure. 
>>>>
>>>> I've put a number of principles and principles in there, that you might 
>>>>> apply. 
>>>>> ie. Could you create the NEO4J equivalent of a temp table?
>>>>>
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>>
>>>>> On Thursday, October 9, 2014 2:41:47 AM UTC-5, gg4u wrote:
>>>>>>
>>>>>> Hi Micheal, thank you.
>>>>>> sure I post my profile result here below !
>>>>>>
>>>>>>
>>>>>>>  -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Reply via email to