Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

gg4u Fri, 07 Nov 2014 07:36:49 -0800

Thank you Mark:

You're right this thread became hard to follow, and issue is still on.
I will re-import everything again since I haven't found a solution: 
maybe there's something I do wrong in importing and creating indexes?


I arranged also a python script to generate a random weighted graph with 
textual labels, as test.
I d love to hear what other people can find out... :))

Here's my contribution:
*https://groups.google.com/forum/#!topic/neo4j/UyqzNZwlKU4 
<https://groups.google.com/forum/#!topic/neo4j/UyqzNZwlKU4>*


Il giorno giovedì 16 ottobre 2014 11:23:03 UTC+2, Mark Findlater ha scritto:
>
> There is a lot of history here that I cannot follow, and Michael is 
> clearly thinking about something which means that the solution is not 
> simple, but your profile (which reads bottom up) does not start well and 
> isn't using your indexes. Unless I have missed something somewhere about 
> why you cannot do this your very last query should perform (much) better if 
> it begins with an Index hit rather than TraversalMatcher.
>
> MATCH (n:Topic{name:"Topic66"}), (m:Topic{name:"Topic111"})
> WITH n, m 
> MATCH (n)-[*..2]-(m)
> WITH p, n, m 
> RETURN p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
> n.proximity) AS pathProximity order by pathProximity;
>
> Also, your assertion "would give unique results since paths are the same 
> ... huh ?" is incorrect, because the paths are not the same, the nodes in 
> the paths may be but the relationships/traversal routes are not. Is there 
> any reason for you to duplicate all of your relationships (given you can 
> navigate them in either direction anyway)?
>
> Apologies if I have gone way off piste,
>
> M
>
> On Wednesday, 15 October 2014 23:12:35 UTC+1, gg4u wrote:
>
> Profile for the last query:
> profile MATCH p = (n:Topic)-[*..2]-(m:Topic) where n.name = 'Topic66' and 
> m.name = 'Topic111' with p, n, m return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity order by 
> pathProximity;
>
> ==> 2411 rows
> ==> 
> ==> ColumnFilter(0)
> ==>   |
> ==>   +Sort
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ColumnFilter(1)
> ==>         |
> ==>         +ExtractPath
> ==>           |
> ==>           +Filter
> ==>             |
> ==>             +TraversalMatcher
> ==> 
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |         Operator |    Rows |  DbHits | Identifiers |                 
>                                             Other |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |  ColumnFilter(0) |    2411 |       0 |             |                 
>                     keep columns p, pathProximity |
> ==> |             Sort |    2411 |       0 |             |                 
>                 Cached(pathProximity of type Any) |
> ==> |          Extract |    2411 |    *9640* |             |             
>                                         pathProximity |
> ==> |  ColumnFilter(1) |    2411 |       0 |             |                 
>                              keep columns p, n, m |
> ==> |      ExtractPath |    2411 |       0 |           p |                 
>                                                   |
> ==> |           Filter |    2411 | 4910094 |             | 
> (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
> ==> | TraversalMatcher | 1636698 | 1681810 |             |                 
>                                 m,   UNNAMED19, m |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
>
> Il giorno giovedì 16 ottobre 2014 00:01:33 UTC+2, gg4u ha scritto:
>
> Sure, I tried three examples with (n), (n:Topic) and allShortestPath() and 
> also profiling them:
>
> 1.
>
> *MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name <http://n.name> = 
> 'Topic1' and m.name <http://m.name> = 'Topic2'    return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity    order by pathProximity DESC  LIMIT 6;*
>
> ==> | 
> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic3"},:P_Topic_Link[5662565]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>  
>                  | 185
> ==> | 
> [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[5662626]{proximity:47},Node[736816]{id:157427,name:"Topic3"},:P_Topic_Link[1025864]{proximity:138},Node[1386672]{id:21245,name:"Topic2"}]
>  
>                  | 185           |
>
> ...
>
>
> *==> 6 rows*
> *==> 162423 ms*
>
>
> *profile* MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name = 
> 'Topic1' and m.name = 'Topic2'    return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity    order by 
> pathProximity DESC  LIMIT 6;
>
> ==> 6 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Top
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ExtractPath
> ==>         |
> ==>         +Filter
> ==>           |
> ==>           +TraversalMatcher
> ==> 
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |         Operator |    Rows |  DbHits | Identifiers |                 
>                                             Other |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |     ColumnFilter |       6 |       0 |             |                 
>                     keep columns p, pathProximity |
> ==> |              Top |       6 |       0 |             |                 
>   {  AUTOINT3};* Cached(pathProximity of type Any) *|
> ==> |          Extract |       9 |      36 |             |                 
>                                     pathProximity |
> ==> |      ExtractPath |       9 |       0 |           p |                 
>                                                   |
> ==> |           Filter |       9 | 3032385 |             | 
> (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
> ==> | TraversalMatcher | 1010795 | 1024307 |             |                 
>                                 m,   UNNAMED20, m |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> 
>
>
> MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where n.name = 
> 'Topic1' and m.name = 'Topic2' with p, n, m return p, reduce(totProximity 
> = 0, n IN relationships(p)| totProximity + n.proximity) AS pathProximity 
> order by pathProximity;
>
> ==> 9 rows
> *==> 10111 ms*
>
>
> ==> 9 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Sort
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ShortestPath
> ==>         |
> ==>         +SchemaIndex(0)
> ==>           |
> ==>           +SchemaIndex(1)
> ==> 
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |       Operator | Rows | DbHits | Identifiers |                       
>       Other |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |   ColumnFilter |    9 |      0 |             |     keep columns p, 
> pathProximity |
> ==> |           Sort |    9 |      0 |             |* 
> Cached(pathProximity of type Any)* |
> ==> |        Extract |    9 |     36 |             |                     
> pathProximity |
> ==> |   ShortestPath |    9 |      0 |           p |                       
>             |
> ==> | SchemaIndex(0) |    1 |      2 |        m, m |     {  AUTOSTRING1}; 
> :Topic(name) |
> ==> | SchemaIndex(1) |    1 |      2 |        n, n |     {  AUTOSTRING0}; 
> :Topic(name) |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
>
>
> 2. 
>
> MATCH  p = (n:Topic)-[*0..2]-(m:Topic)   where n.name = 'Topic44' and 
> m.name = 'Topic2'    return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity    order by 
> pathProximity DESC  LIMIT 6;
>
> ==> 6 rows
> *==> 906108 ms*
>
>
>
> ==> 6 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Top
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ExtractPath
> ==>         |
> ==>         +Filter
> ==>           |
> ==>           +TraversalMatcher
> ==> 
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |         Operator |    Rows |  DbHits | Identifiers |                 
>                                             Other |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
> ==> |     ColumnFilter |       6 |       0 |             |                 
>                     keep columns p, pathProximity |
> ==> |              Top |       6 |       0 |             |                 
>   {  AUTOINT3}; Cached(pathProximity of type Any) |
> ==> |          Extract |      67 |     268 |             |                 
>                                     pathProximity |
> ==> |      ExtractPath |      67 |       0 |           p |                 
>                                                   |
> ==> |           Filter |      67 | 3246003 |             | 
> (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
> ==> | TraversalMatcher | 1082001 | 1097166 |             |                 
>                                 m,   UNNAMED20, m |
> ==> 
> +------------------+---------+---------+-------------+-------------------------------------------------------------------+
>
>
>
> MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where n.name = 
> 'Topic44' and m.name = 'Topic2' with p, n, m return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity order by pathProximity;
>
>
> magically and for first time:
> *146ms*
>
>
> so:
>
> profile MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where 
> n.name = 'Topic44' and m.name = 'Topic2' with p, n, m return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity order by pathProximity;
>
>
> ==> 67 rows
> ==> 
> ==> ColumnFilter
> ==>   |
> ==>   +Sort
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ShortestPath
> ==>         |
> ==>         +SchemaIndex(0)
> ==>           |
> ==>           +SchemaIndex(1)
> ==> 
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |       Operator | Rows | DbHits | Identifiers |                       
>       Other |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> |   ColumnFilter |   67 |      0 |             |     keep columns p, 
> pathProximity |
> ==> |           Sort |   67 |      0 |             | Cached(pathProximity 
> of type Any) |
> ==> |        Extract |   67 |    268 |             |                     
> pathProximity |
> ==> |   ShortestPath |   67 |      0 |           p |                       
>             |
> ==> | SchemaIndex(0) |    1 |      2 |        m, m |     {  AUTOSTRING1}; 
> :Topic(name) |
> ==> | SchemaIndex(1) |    1 |      2 |        n, n |     {  AUTOSTRING0}; 
> :Topic(name) |
> ==> 
> +----------------+------+--------+-------------+-----------------------------------+
> ==> 
>
>
>
>
> 3. 
> So I tried:
>
> MATCH p = *allShortestPaths*((n:Topic)-[*..2]-(m:Topic)) where n.name = 
> 'Topic66' and m.name = 'Topic111' with p, n, m return p, 
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity) 
> AS pathProximity order by pathProximity;
>
> 2 rows
> 34337 ms
>
> and 
>
> MATCH p = (n:Topic)-[*..2]-(m:Topic) where n.name = 'Topic66' and m.name 
> = 'Topic111' with p, n, m return p, reduce(totProximity = 0, n IN 
> relationships(p)| totProximity + n.proximity) AS pathProximity order by 
> pathProximity;
>
> *2411 rows*
> *3228423 ms !!*
>
> Please also note that for each row there is a duplicate
> (in my structure I do have (a:Topic)-[]->(b:Topic) and 
> (b:Topic)-[]->(a:Topic), but I thought that (a:Topic)-[]-(b:Topic) would 
> give unique results since paths are the same ... huh ?
> ...
> ==> | 
> [Node[1103460]{id:18831,name:"Topic66"},:P_Topic_Link[68136903]{proximity:189},Node[1198508]{id:19594028,name:"Topic113"},:P_Topic_Link[68136874]{proximity:368},Node[1603710]{id:22939,name:"Topic111"}]
>  
>                                                                           
>  | 557           |
> ==> | 
> [Node[1103460]{id:18831,name:"Topic66"},:P_Topic_Link[68136903]{proximity:189},Node[1198508]{id:19594028,name:"Topic113"},:P_Topic_Link[1113182]{proximity:368},Node[1603710]{id:22939,name:"Topic111"}]
>  
>                                                                             
> | 557           |
>
>
>
>
> So I have that **allShortestPath()** gives faster time and **almost** 
> wanted results **only** if previously searches were made (cached). May it 
> be true?
> It d make sense partially: I expect graph algorithms faster than 
> retrieving paths, but a time for retriving 67 rows of general paths cannot 
> be that slow... (> 100 order of magnitude slower than allShortestPath() ?? )
>
> Would it make sense if post a script in python to generate a random 
> structure similar to the one I have, post again the configurations files 
> used for my server and batch-importer, post the header I used for loading 
> the csv with the batch importer, and you could tell me if responsive time 
> is less 1s (production time) ?
>  you could try same tests and post results and a step by step guide ? 
>
>
>
>
>
> Il giorno mercoledì 15 ottobre 2014 21:56:01 UTC+2, Michael Hunger ha 
> scritto:
>
> Can you just try this please?
>
> MATCH  p = (n:Topic)-[*0..2]-(m:Topic) 
>  where n.name = 'Topic1' and m.name = 'Topic2'  
>  return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
> n.proximity) AS pathProximity  
>  order by pathProximity DESC  LIMIT 6;
>
>
>
> On Wed, Oct 15, 2014 at 2:52 PM, gg4u <[email protected]> wrote:
>
> Hi Michael,
>
> sorry I don't understand what it means.
> Can I help you in helping me sorting out the issue somehow? :)
>
> What could I check or correct ?
> What is a pattern matcher and can you teach in reading the profile for 
> making your conclusion?
> Which may be possible reasons for selecting wrong pattern matcher, how to 
> correct it?
>
> thank you
>
> Il giorno mercoledì 15 ottobre 2014 14:04:57 UTC+2, Michael Hunger ha 
> scritto:
>
> Hi,
>
> from the profiling it seems that Cypher selects the wrong pattern matcher 
> if we separate the node-lookup and path-match.
>
> profile
>  MATCH  p = (n:Topic)-[*0..2]-(m:Topic) 
>  where n.name = 'Topic1' and m.name = 'Topic2'  
>  return p, reduce(totProximity = 0, n IN relationships(p)| totProximity + 
> n.proximity) AS pathProximity  
>  order by pathProximity DESC  LIMIT 6;
>
>
> +------------------+------+--------+-------------+----------
> ---------------------------------------------------------+
> |         Operator | Rows | DbHits | Identifiers |                         
>                                     Other |
> +------------------+------+--------+-------------+----------
> ---------------------------------------------------------+
> |     ColumnFilter |    0 |      0 |             |                         
>             keep columns p, pathProximity |
> |              Top |    0 |      0 |             |                   { 
>  AUTOINT3}; Cached(pathProximity of type Any) |
> |          Extract |    0 |      0 |             |                         
>                             pathProximity |
> |      ExtractPath |    0 |      0 |           p |                         
>                                           |
> |           Filter |    0 |      0 |             | (hasLabel(m:Topic(0)) 
> AND Property(m,name(1)) == {  AUTOSTRING1}) |
> | TraversalMatcher |    0 |      1 |             |                         
>                         m,   UNNAMED20, m |
> +------------------+------+--------+-------------+----------
> ---------------------------------------------------------+
>
> On Wed, Oct 15, 2014 at 11:00 AM, gg4u <[email protected]> wrote:
>
> Hi Micheal, 
>
> your aggregation was only on the same paths, so you get 9 different paths 
> but you didn't show the counts per path. 
>
>
> not clear to me yet; I am gonna post results for each query you suggested 
> to try out.
>
> Rodger, to summarize a description of this test:
> 4M nodes labeled 'Topic'
> 100M rels (weighted)
> Index on Topic(name) > 'is a string type property for each node'
> 'Topic' dominates all dataset and this will be a subgraph of a larger 
> network (if we I can set this in production time, a next step will have a 
> graph of 85M nodes, ~2B rels, with same type of structure putting 
> properties as nodes' properties and not decoupling to other nodes). So this 
> is a primary, real case test, to see if it is feasible using Neo4j 
> datastructure Vs NoSQL.
> And I'd love the answer be yes :D
>
> Micheal, here another test with other topics (I think not cached):
>
> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = '
> *Topic100*' and m.name = '*Topic2*' with p, n, m return p, count(*) order 
> by count(*);
>
> results:
> ==> +-----------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------
> ------------------------------------------------------------
> ---------------+
> ==> | p                                                                   
>                                                                             
>                                                                             
>                       | count(*) |
> ==> +-----------------------------------------------------------
> ----------------------------------------
>
> ...

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Reply via email to