Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

gg4u Tue, 14 Oct 2014 01:59:30 -0700

Yes:

neo4j-sh (?)$ profile  MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' 
and m.name = 'Topic2'  MATCH  p = (n)-[*0..2]-(m) return p, 
reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
AS pathProximity  order by pathProximity DESC  LIMIT 6;
==> 
[...results...]
==> 6 rows
==> 
==> ColumnFilter
==>   |
==>   +Top
==>     |
==>     +Extract
==>       |
==>       +ExtractPath
==>         |
==>         +PatternMatcher
==>           |
==>           +SchemaIndex(0)
==>             |
==>             +SchemaIndex(1)
==> 
==> 
+----------------+------+--------+-------------------+-------------------------------------------------+
==> |       Operator | Rows | DbHits |       Identifiers |                 
                          Other |
==> 
+----------------+------+--------+-------------------+-------------------------------------------------+
==> |   ColumnFilter |    6 |      0 |                   |                 
  keep columns p, pathProximity |
==> |            Top |    6 |      0 |                   | {  AUTOINT3}; 
Cached(pathProximity of type Any) |
==> |        Extract |    9 |     36 |                   |                 
                  pathProximity |
==> |    ExtractPath |    9 |      0 |                 p |                 
                                |
==> | PatternMatcher |    9 |      0 | n, m,   UNNAMED94 |                 
                                |
==> | SchemaIndex(0) |    1 |      2 |              m, m |                 
  {  AUTOSTRING1}; :Topic(name) |
==> | SchemaIndex(1) |    1 |      2 |              n, n |                 
  {  AUTOSTRING0}; :Topic(name) |
==> 
+----------------+------+--------+-------------------+-------------------------------------------------+
==> 
neo4j-sh (?)$




Il giorno martedì 14 ottobre 2014 10:00:29 UTC+2, Michael Hunger ha scritto:
>
> Can you try this:
>
> profile 
> MATCH (n:Topic), (m:Topic)
>  where n.name = 'Topic1' and m.name = 'Topic2' 
> MATCH  p = (n)-[*0..2]-(m)
> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
> n.proximity) AS pathProximity 
> order by pathProximity DESC 
> LIMIT 6
>
>
>
> On Tue, Oct 14, 2014 at 9:06 AM, gg4u <[email protected] <javascript:>> 
> wrote:
>
>> Hi Rodjer,
>>
>> thank you for your insights!
>> please see comments below:
>>
>> Il giorno lunedì 13 ottobre 2014 18:37:50 UTC+2, Rodger ha scritto:
>>>
>>> Hello,
>>>
>>> I've done a lot of RDBMS performance tuning.
>>> Just a few quick thoughts.
>>>
>>>
>>> Be sure to run the queries in the shell, if you are not already doing so.
>>>
>>>
>> Yes, they are run in the shell:
>> http://localhost:7474/webadmin/#/console/
>>  
>>
>>> How many rows are returned? Just sorting, then returning many rows, 
>>> takes a long time to scroll them to output. 
>>>
>>>
>>>
>> 9 rows
>> In the answer above, I wrote 9 paths
>>
>>  
>>
>>>
>>> If you are getting duplicates, it may be the equivalent of a cartesian 
>>> product, 
>>> one of the worst things that can happen in RDBMS, and also one
>>> of the least known. See my presentation on them here:
>>> http://rodgersnotes.wordpress.com/2010/09/15/stamping-out-
>>> cartesian-products/ 
>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F15%2Fstamping-out-cartesian-products%2F&sa=D&sntz=1&usg=AFQjCNHJDOJ0IOsI6XRsg_9yuTscI4mqtQ>
>>>
>>
>> So I had a look at your pdf,
>> http://rodgersnotes.files.wordpress.com/2010/09/cartprodwordpress.pdf
>> page 11
>>
>> and I think the idea you want to suggest, is to avoid duplicates (you 
>> called them 'cartesian products') by enforcing conditions.
>> Though, since it is a graph db and not relational, not clear to me where 
>> this applies because in the graph db I don't have 'jointed' queries between 
>> tables,
>> so the conditions I have are, at least in my case, properties (index on 
>> properties), and no-directional rels.
>>  
>>
>>>
>>>
>>> Try:
>>>
>>> return p, count (*) 
>>> order by count(*)
>>>
>>
>> I run:
>>
>> profile MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 
>> 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order by 
>> count(*);
>>
>> and I've got: (see there are also duplicates in paths: is it because I 
>> have both (a)-[]->(b) and (a)<-[]-(b) ?)
>>
>> ==> 
>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>> ==> | p                                                                   
>>                                                                             
>>                                                                             
>>                 | count(*) |
>> ==> 
>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[71185298]{proximity:68},Node[1401899]{id:21375850,name:"Topic3"},:P_Topic_Link[71185313]{proximity:32},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>                   | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[88675719]{proximity:28},Node[2594397]{id:31760062,name:"Topic4"},:P_Topic_Link[88675745]{proximity:23},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>           | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[30736000]{proximity:32},Node[2515502]{id:
>> 3106745,name:"Topic5"},:P_Topic_Link[30735974]{proximity:82},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>> | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[68206383]{proximity:72},Node[1202629]{id:19635605,name:"Topic6"},:P_Topic_Link[68206440]{proximity:32},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>              | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[98898173]{proximity:23},Node[3329750]{id:38567205,name:"Topic7"},:P_Topic_Link[98898126]{proximity:124},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>                        | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[58107755]{proximity:55},Node[506613]{id:13841207,name:"Topic8"},:P_Topic_Link[58107766]{proximity:27},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>                             | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[98898173]{proximity:23},Node[3329750]{id:38567205,name:"Topic7"},:P_Topic_Link[1025873]{proximity:124},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>                         | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic9"},:P_Topic_Link[5662565]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>                  | 1        |
>> ==> | 
>> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic9"},:P_Topic_Link[1025864]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>>  
>>                  | 1        |
>> ==> 
>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>> ==> 9 rows
>> ==> 
>> ==> ColumnFilter(0)
>> ==>   |
>> ==>   +Sort
>> ==>     |
>> ==>     +EagerAggregation
>> ==>       |
>> ==>       +ColumnFilter(1)
>> ==>         |
>> ==>         +ExtractPath
>> ==>           |
>> ==>           +Filter
>> ==>             |
>> ==>             +TraversalMatcher
>> ==> 
>> ==> 
>> +------------------+---------+---------+-------------+----------------------------------------------------------------------------------+
>> ==> |         Operator |    Rows |  DbHits | Identifiers |               
>>                                                              Other |
>> ==> 
>> +------------------+---------+---------+-------------+----------------------------------------------------------------------------------+
>> ==> |  ColumnFilter(0) |       9 |       0 |             |               
>>                                           keep columns p, count(*) |
>> ==> |             Sort |       9 |       0 |             | Cached( 
>>  INTERNAL_AGGREGATE931614f3-4def-4fc4-a80b-c6fca3839817 of type Integer) |
>> ==> | EagerAggregation |       9 |       0 |             |               
>>                                                                  p |
>> ==> |  ColumnFilter(1) |       9 |       0 |             |               
>>                                               keep columns p, n, m |
>> ==> |      ExtractPath |       9 |       0 |           p |               
>>                                                                    |
>> ==> |           Filter |       9 | 3032385 |             |               
>>  (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
>> ==> | TraversalMatcher | 1010795 | 1024307 |             |               
>>                                                  m,   UNNAMED36, m |
>> ==> 
>> +------------------+---------+---------+-------------+----------------------------------------------------------------------------------+
>> ==> 
>>
>>>
>>>
>>>
>>> Without me looking at the raw data, and the query result, you
>>> seem to have many operations going on. So, you have a lot of rows in 
>>> the profile output. 
>>>
>>
>> Only 9
>>  
>>
>>>  As a general rule, the more rows there are in the 
>>> profile, the slower the response time is. 
>>> ie. the more complex the query, the slower it is. 
>>>
>>>
>>> If I were looking at this, I would try to isolate which part of 
>>> the query is the slow part.  The Return clause, or the Match clause?
>>>
>>>
>>> You've already tried the response times with the data.
>>> Try to simply: 
>>> return count(*) .
>>>
>>
>> I run:
>> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic1' 
>> and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);
>>
>> and obtain 9 rows in 182799 ms
>>
>> I run:
>> MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name = 'Topic2' 
>> with n, m return count(*);
>>
>> and obtain 856ms
>>
>>
>> profile MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name = 
>> 'Topic2' with n, m return count(*);
>>
>> results in:
>>
>>
>> ==> ColumnFilter
>> ==>   |
>> ==>   +EagerAggregation
>> ==>     |
>> ==>     +SchemaIndex(0)
>> ==>       |
>> ==>       +SchemaIndex(1)
>> ==> 
>> ==> 
>> +------------------+------+--------+-------------+-------------------------------+
>> ==> |         Operator | Rows | DbHits | Identifiers |                   
>>       Other |
>> ==> 
>> +------------------+------+--------+-------------+-------------------------------+
>> ==> |     ColumnFilter |    1 |      0 |             |         keep 
>> columns count(*) |
>> ==> | EagerAggregation |    1 |      0 |             |                   
>>             |
>> ==> |   SchemaIndex(0) |    1 |      2 |        m, m | {  AUTOSTRING1}; 
>> :Topic(name) |
>> ==> |   SchemaIndex(1) |    1 |      2 |        n, n | {  AUTOSTRING0}; 
>> :Topic(name) |
>> ==> 
>> +------------------+------+--------+-------------+-------------------------------+
>>  
>>
>>> How many seconds response time is that, versus the original query? 
>>> What is the resulting profile?
>>>
>>>
>>>
>>
>> So, it looks like it actually take huge time in traversing the graph,
>> while reasonable time '~900ms' to match a fullstring node.
>>
>> *Any idea for improving performance of traversal??*
>>
>> *It is a real problem, since also for getting results of first neighbors 
>> of a node, I met the same problem which makes currently unfeasible for 
>> production :*
>> *Anyone with real case of similar size graph and structure trying to 
>> perform a similar query?*
>>
>> as example, this query to obtain first neighbors of node Topic44:
>>
>> MATCH (n:Topic) , (m), p = (n)-[*0..1]-(m)
>> where n.name = 'Topic44' 
>> with p, n, m
>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
>> n.proximity) AS pathProximity order by pathProximity DESC LIMIT 6
>>
>> returns
>> 6 rows in ~65000 ms VS 6 rows in less than a second with a NoSQL.
>>
>> Any idea?
>>
>> thank you guys for helping!! Hope to find a solution soon..
>>
>>
>>
>>
>>>
>>>
>>> See also the tuning presentations I've done: 
>>> http://rodgersnotes.wordpress.com/2010/09/14/oracle-performance-tuning/ 
>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F14%2Foracle-performance-tuning%2F&sa=D&sntz=1&usg=AFQjCNE0XK_XcNk5YBj806h6a1OJHr0glA>
>>> http://rodgersnotes.wordpress.com/2014/06/08/tuning-the-
>>> untunable-when-indexes-and-optimizer-dont-help-2/ 
>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2014%2F06%2F08%2Ftuning-the-untunable-when-indexes-and-optimizer-dont-help-2%2F&sa=D&sntz=1&usg=AFQjCNFgTfu5bnjPw6boHWttJpzQBtaNgw>
>>> They are quick reads. 
>>>
>>> thank you, seen them, 
>> they are about SQL tuning mostly:
>> I've just used neo4j strucutre to store a graph with same label on 4M 
>> topics (I MUST keep it with one label), index on topic(name) property and 
>> used cypher to query the db,
>> this is my data structure. 
>>
>> I've put a number of principles and principles in there, that you might 
>>> apply. 
>>> ie. Could you create the NEO4J equivalent of a temp table?
>>>
>>>
>>> Hope this helps.
>>>
>>>
>>> On Thursday, October 9, 2014 2:41:47 AM UTC-5, gg4u wrote:
>>>>
>>>> Hi Micheal, thank you.
>>>> sure I post my profile result here below !
>>>>
>>>>
>>>>>  -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Reply via email to