Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Mark Findlater Thu, 16 Oct 2014 02:01:19 -0700

There is a lot of history here that I cannot follow, and Michael is clearly 
thinking about something which means that the solution is not simple, but 
your profile (which reads bottom up) does not start well and isn't using 
your indexes. Unless I have missed something somewhere about why you cannot 
do this your very last query should perform (much) better if it begins with 
an Index hit rather than TraversalMatcher.


MATCH (n:Topic{name:"Topic66"}), (m:Topic{name:"Topic111"})
WITH n, m  
MATCH p = (n:Topic)-[*..2]-(m:Topic)
WITH p, n, m 
RETURN p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
n.proximity) AS pathProximity order by pathProximity;

Also, your assertion "would give unique results since paths are the same 
... huh ?" is incorrect, because the paths are not the same, the nodes in 
the paths may be but the relationships/traversal routes are not. Is there 
any reason for you to duplicate all of your relationships (given you can 
navigate them in either direction anyway)?

Apologies if I have gone way off piste,

M


On Wednesday, 15 October 2014 23:12:35 UTC+1, gg4u wrote:
>
> Profile for the last query:
> profile MATCH p = (n:Topic)-[*..2]-(m:Topic) where n.name = 'Topic66' and 
> m.name = 'Topic111' with p, n, m return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity order by 
> pathProximity;
>
> ==> 2411 rows
> ==> 
> ==> ColumnFilter(0)
> ==>   |
> ==>   +Sort
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ColumnFilter(1)
> ==>         |
> ==>         +ExtractPath
> ==>           |
> ==>           +Filter
> ==>             |
> ==>             +TraversalMatcher
> ==> 
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |         Operator |    Rows |  DbHits | Identifiers |                 
>                                             Other |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |  ColumnFilter(0) |    2411 |       0 |             |                 
>                     keep columns p, pathProximity |
> ==> |             Sort |    2411 |       0 |             |                 
>                 Cached(pathProximity of type Any) |
> ==> |          Extract |    2411 |    *9640* |             |             
>                                         pathProximity |
> ==> |  ColumnFilter(1) |    2411 |       0 |             |                 
>                              keep columns p, n, m |
> ==> |      ExtractPath |    2411 |       0 |           p |                 
>                                                   |
> ==> |           Filter |    2411 | 4910094 |             | 
> (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
> ==> | TraversalMatcher | 1636698 | 1681810 |             |                 
>                                 m,   UNNAMED19, m |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
>
> Il giorno giovedì 16 ottobre 2014 00:01:33 UTC+2, gg4u ha scritto:
>
> Sure, I tried three examples with (n), (n:Topic) and allShortestPath() and 
> also profiling them:
>
> 1.
>
> *MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name <http://n.name> = 
> 'Topic1' and m.name <http://m.name> = 'Topic2'    return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity    order by pathProximity DESC  LIMIT 6;*
>
> ==> | 
> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic3"},:P_Topic_Link[5662565]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>  
>                  | 185
> ==> | 
> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic3"},:P_Topic_Link[1025864]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>  
>                  | 185           |
>
> ...
>
>
> *==> 6 rows*
> *==> 162423 ms*
>
>
> *profile* MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name = 
> 'Topic1' and m.name = 'Topic2'    return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity    order by 
> pathProximity DESC  LIMIT 6;
>
> ==> 6 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Top
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ExtractPath
> ==>         |
> ==>         +Filter
> ==>           |
> ==>           +TraversalMatcher
> ==> 
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |         Operator |    Rows |  DbHits | Identifiers |                 
>                                             Other |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |     ColumnFilter |       6 |       0 |             |                 
>                     keep columns p, pathProximity |
> ==> |              Top |       6 |       0 |             |                 
>   {  AUTOINT3};* Cached(pathProximity of type Any) *|
> ==> |          Extract |       9 |      36 |             |                 
>                                     pathProximity |
> ==> |      ExtractPath |       9 |       0 |           p |                 
>                                                   |
> ==> |           Filter |       9 | 3032385 |             | 
> (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
> ==> | TraversalMatcher | 1010795 | 1024307 |             |                 
>                                 m,   UNNAMED20, m |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> 
>
>
> MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where n.name = 
> 'Topic1' and m.name = 'Topic2' with p, n, m return p, reduce(totProximity 
> = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
> order by pathProximity;
>
> ==> 9 rows
> *==> 10111 ms*
>
>
> ==> 9 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Sort
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ShortestPath
> ==>         |
> ==>         +SchemaIndex(0)
> ==>           |
> ==>           +SchemaIndex(1)
> ==> 
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |       Operator | Rows | DbHits | Identifiers |                       
>       Other |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |   ColumnFilter |    9 |      0 |             |     keep columns p, 
> pathProximity |
> ==> |           Sort |    9 |      0 |             |* 
> Cached(pathProximity of type Any)* |
> ==> |        Extract |    9 |     36 |             |                     
> pathProximity |
> ==> |   ShortestPath |    9 |      0 |           p |                       
>             |
> ==> | SchemaIndex(0) |    1 |      2 |        m, m |     {  AUTOSTRING1}; 
> :Topic(name) |
> ==> | SchemaIndex(1) |    1 |      2 |        n, n |     {  AUTOSTRING0}; 
> :Topic(name) |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
>
>
> 2. 
>
> MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name = 'Topic44' and 
> m.name = 'Topic2'    return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity    order by 
> pathProximity DESC  LIMIT 6;
>
> ==> 6 rows
> *==> 906108 ms*
>
>
>
> ==> 6 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Top
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ExtractPath
> ==>         |
> ==>         +Filter
> ==>           |
> ==>           +TraversalMatcher
> ==> 
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |         Operator |    Rows |  DbHits | Identifiers |                 
>                                             Other |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |     ColumnFilter |       6 |       0 |             |                 
>                     keep columns p, pathProximity |
> ==> |              Top |       6 |       0 |             |                 
>   {  AUTOINT3}; Cached(pathProximity of type Any) |
> ==> |          Extract |      67 |     268 |             |                 
>                                     pathProximity |
> ==> |      ExtractPath |      67 |       0 |           p |                 
>                                                   |
> ==> |           Filter |      67 | 3246003 |             | 
> (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
> ==> | TraversalMatcher | 1082001 | 1097166 |             |                 
>                                 m,   UNNAMED20, m |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
>
>
>
> MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where n.name = 
> 'Topic44' and m.name = 'Topic2' with p, n, m return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity order by pathProximity;
>
>
> magically and for first time:
> *146ms*
>
>
> so:
>
> profile MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where 
> n.name = 'Topic44' and m.name = 'Topic2' with p, n, m return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity order by pathProximity;
>
>
> ==> 67 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Sort
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ShortestPath
> ==>         |
> ==>         +SchemaIndex(0)
> ==>           |
> ==>           +SchemaIndex(1)
> ==> 
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |       Operator | Rows | DbHits | Identifiers |                       
>       Other |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |   ColumnFilter |   67 |      0 |             |     keep columns p, 
> pathProximity |
> ==> |           Sort |   67 |      0 |             | Cached(pathProximity 
> of type Any) |
> ==> |        Extract |   67 |    268 |             |                     
> pathProximity |
> ==> |   ShortestPath |   67 |      0 |           p |                       
>             |
> ==> | SchemaIndex(0) |    1 |      2 |        m, m |     {  AUTOSTRING1}; 
> :Topic(name) |
> ==> | SchemaIndex(1) |    1 |      2 |        n, n |     {  AUTOSTRING0}; 
> :Topic(name) |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> 
>
>
>
>
> 3. 
> So I tried:
>
> MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where n.name = 
> 'Topic66' and m.name = 'Topic111' with p, n, m return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity order by pathProximity;
>
> 2 rows
> 34337 ms
>
> and 
>
> MATCH p = (n:Topic)-[*..2]-(m:Topic) where n.name = 'Topic66' and m.name 
> = 'Topic111' with p, n, m return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity order by 
> pathProximity;
>
> *2411 rows*
> *3228423 ms !!*
>
> Please also note that for each row there is a duplicate
> (in my structure I do have (a:Topic)-[]->(b:Topic) and 
> (b:Topic)-[]->(a:Topic), but I thought that (a:Topic)-[]-(b:Topic) would 
> give unique results since paths are the same ... huh ?
> ...
> ==> | 
> [Node[1103460]{id:18831,name:"Topic66"},:P_Topic_Link[68136903]{proximity:189},Node[1198508]{id:19594028,name:"Topic113"},:P_Topic_Link[68136874]{proximity:368},Node[1603710]{id:22939,name:"Topic111"}]
>  
>                                                                           
>  | 557           |
> ==> | 
> [Node[1103460]{id:18831,name:"Topic66"},:P_Topic_Link[68136903]{proximity:189},Node[1198508]{id:19594028,name:"Topic113"},:P_Topic_Link[1113182]{proximity:368},Node[1603710]{id:22939,name:"Topic111"}]
>  
>                                                                             
> | 557           |
>
>
>
>
> So I have that **allShortestPath()** gives faster time and **almost** 
> wanted results **only** if previously searches were made (cached). May it 
> be true?
> It d make sense partially: I expect graph algorithms faster than 
> retrieving paths, but a time for retriving 67 rows of general paths cannot 
> be that slow... (> 100 order of magnitude slower than allShortestPath() ?? )
>
> Would it make sense if post a script in python to generate a random 
> structure similar to the one I have, post again the configurations files 
> used for my server and batch-importer, post the header I used for loading 
> the csv with the batch importer, and you could tell me if responsive time 
> is less 1s (production time) ?
>  you could try same tests and post results and a step by step guide ? 
>
>
>
>
>
> Il giorno mercoledì 15 ottobre 2014 21:56:01 UTC+2, Michael Hunger ha 
> scritto:
>
> Can you just try this please?
>
> MATCH  p = (n:Topic)-[*0..2]-(m:Topic) 
>  where n.name = 'Topic1' and m.name = 'Topic2'  
>  return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
> n.proximity) AS pathProximity  
>  order by pathProximity DESC  LIMIT 6;
>
>
>
> On Wed, Oct 15, 2014 at 2:52 PM, gg4u <[email protected]> wrote:
>
> Hi Michael,
>
> sorry I don't understand what it means.
> Can I help you in helping me sorting out the issue somehow? :)
>
> What could I check or correct ?
> What is a pattern matcher and can you teach in reading the profile for 
> making your conclusion?
> Which may be possible reasons for selecting wrong pattern matcher, how to 
> correct it?
>
> thank you
>
> Il giorno mercoledì 15 ottobre 2014 14:04:57 UTC+2, Michael Hunger ha 
> scritto:
>
> Hi,
>
> from the profiling it seems that Cypher selects the wrong pattern matcher 
> if we separate the node-lookup and path-match.
>
> profile
>  MATCH  p = (n:Topic)-[*0..2]-(m:Topic) 
>  where n.name = 'Topic1' and m.name = 'Topic2'  
>  return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
> n.proximity) AS pathProximity  
>  order by pathProximity DESC  LIMIT 6;
>
>
> +------------------+------+--------+-------------+----------
> ---------------------------------------------------------+
> |         Operator | Rows | DbHits | Identifiers |                         
>                                     Other |
> +------------------+------+--------+-------------+----------
> ---------------------------------------------------------+
> |     ColumnFilter |    0 |      0 |             |                         
>             keep columns p, pathProximity |
> |              Top |    0 |      0 |             |                   { 
>  AUTOINT3}; Cached(pathProximity of type Any) |
> |          Extract |    0 |      0 |             |                         
>                             pathProximity |
> |      ExtractPath |    0 |      0 |           p |                         
>                                           |
> |           Filter |    0 |      0 |             | (hasLabel(m:Topic(0)) 
> AND Property(m,name(1)) == {  AUTOSTRING1}) |
> | TraversalMatcher |    0 |      1 |             |                         
>                         m,   UNNAMED20, m |
> +------------------+------+--------+-------------+----------
> ---------------------------------------------------------+
>
> On Wed, Oct 15, 2014 at 11:00 AM, gg4u <[email protected]> wrote:
>
> Hi Micheal, 
>
> your aggregation was only on the same paths, so you get 9 different paths 
> but you didn't show the counts per path. 
>
>
> not clear to me yet; I am gonna post results for each query you suggested 
> to try out.
>
> Rodger, to summarize a description of this test:
> 4M nodes labeled 'Topic'
> 100M rels (weighted)
> Index on Topic(name) > 'is a string type property for each node'
> 'Topic' dominates all dataset and this will be a subgraph of a larger 
> network (if we I can set this in production time, a next step will have a 
> graph of 85M nodes, ~2B rels, with same type of structure putting 
> properties as nodes' properties and not decoupling to other nodes). So this 
> is a primary, real case test, to see if it is feasible using Neo4j 
> datastructure Vs NoSQL.
> And I'd love the answer be yes :D
>
> Micheal, here another test with other topics (I think not cached):
>
> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = '
> *Topic100*' and m.name = '*Topic2*' with p, n, m return p, count(*) order 
> by count(*);
>
> results:
> ==> +-----------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------
> ---------------+
> ==> | p                                                                   
>                                                                             
>                                                                             
>                       | count(*) |
> ==> +-----------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------
> ---------------+
> ==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[
> 10618620]{proximity:90},Node[3528892]{id:411782,name:"
> Topic101"},:P_Topic_Link[1025954]{proximity:68},Node[
> 1386672]{id:21245,name:"Topic2"}]                                         
>       | 1        |
> ==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[
> 2424845]{proximity:91},Node[3719110]{id:52502,name:"
> Topic102"},:P_Topic_Link[1025923]{proximity:85},Node[
> 1386672]{id:21245,name:"Topic2"}]                    | 1        |
> ==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[
> 100682940]{proximity:19},Node[3461206]{id:39782569,name:"
> Topic103"},:P_Topic_Link[100682931]{proximity:107},
> Node[1386672]{id:21245,name:"Topic2"}]            | 1        |
> ==> | [Node[4114904]{id:7955,name:"Topic100"},:P_Topic_Link[
> 21653222]{proximity:82},Node[706102]{id:1551073,name:"
> Topic104"},:P_Topic_Link[21653218]{proximity:87},Node[
> 1386672]{id:21245,name:"Topic2"}]                                 | 1     
>    |
>
> (.... results ...)
>  
> ==> +-----------------------------------------------------------<
>
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Reply via email to