Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Michael Hunger Tue, 14 Oct 2014 13:55:44 -0700

How many rows does this return?

MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name = 'Topic1'
and m.name = 'Topic2' with p, n, m return p, count(*) order by count(*);


your aggregation was only on the same paths, so you get 9 different paths
but you didn't show the counts per path.

and obtain 9 rows in 182799 ms

On Tue, Oct 14, 2014 at 10:59 AM, gg4u <[email protected]> wrote:

> Yes:
>
> neo4j-sh (?)$ profile  MATCH (n:Topic), (m:Topic) where n.name = 'Topic1'
> and m.name = 'Topic2'  MATCH  p = (n)-[*0..2]-(m) return p,
> reduce(totProximity = 0, n IN relationships(p)| totProximity + n.proximity)
> AS pathProximity  order by pathProximity DESC  LIMIT 6;
> ==>
> [...results...]
> ==> 6 rows
> ==>
> ==> ColumnFilter
> ==>   |
> ==>   +Top
> ==>     |
> ==>     +Extract
> ==>       |
> ==>       +ExtractPath
> ==>         |
> ==>         +PatternMatcher
> ==>           |
> ==>           +SchemaIndex(0)
> ==>             |
> ==>             +SchemaIndex(1)
> ==>
> ==>
> +----------------+------+--------+-------------------+-------------------------------------------------+
> ==> |       Operator | Rows | DbHits |       Identifiers |
>                           Other |
> ==>
> +----------------+------+--------+-------------------+-------------------------------------------------+
> ==> |   ColumnFilter |    6 |      0 |                   |
>   keep columns p, pathProximity |
> ==> |            Top |    6 |      0 |                   | {  AUTOINT3};
> Cached(pathProximity of type Any) |
> ==> |        Extract |    9 |     36 |                   |
>                   pathProximity |
> ==> |    ExtractPath |    9 |      0 |                 p |
>                                 |
> ==> | PatternMatcher |    9 |      0 | n, m,   UNNAMED94 |
>                                 |
> ==> | SchemaIndex(0) |    1 |      2 |              m, m |
>   {  AUTOSTRING1}; :Topic(name) |
> ==> | SchemaIndex(1) |    1 |      2 |              n, n |
>   {  AUTOSTRING0}; :Topic(name) |
> ==>
> +----------------+------+--------+-------------------+-------------------------------------------------+
> ==>
> neo4j-sh (?)$
>
>
>
> Il giorno martedì 14 ottobre 2014 10:00:29 UTC+2, Michael Hunger ha
> scritto:
>>
>> Can you try this:
>>
>> profile
>> MATCH (n:Topic), (m:Topic)
>>  where n.name = 'Topic1' and m.name = 'Topic2'
>> MATCH  p = (n)-[*0..2]-(m)
>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity +
>> n.proximity) AS pathProximity
>> order by pathProximity DESC
>> LIMIT 6
>>
>>
>>
>> On Tue, Oct 14, 2014 at 9:06 AM, gg4u <[email protected]> wrote:
>>
>>> Hi Rodjer,
>>>
>>> thank you for your insights!
>>> please see comments below:
>>>
>>> Il giorno lunedì 13 ottobre 2014 18:37:50 UTC+2, Rodger ha scritto:
>>>>
>>>> Hello,
>>>>
>>>> I've done a lot of RDBMS performance tuning.
>>>> Just a few quick thoughts.
>>>>
>>>>
>>>> Be sure to run the queries in the shell, if you are not already doing
>>>> so.
>>>>
>>>>
>>> Yes, they are run in the shell:
>>> http://localhost:7474/webadmin/#/console/
>>>
>>>
>>>> How many rows are returned? Just sorting, then returning many rows,
>>>> takes a long time to scroll them to output.
>>>>
>>>>
>>>>
>>> 9 rows
>>> In the answer above, I wrote 9 paths
>>>
>>>
>>>
>>>>
>>>> If you are getting duplicates, it may be the equivalent of a cartesian
>>>> product,
>>>> one of the worst things that can happen in RDBMS, and also one
>>>> of the least known. See my presentation on them here:
>>>> http://rodgersnotes.wordpress.com/2010/09/15/stamping-out-ca
>>>> rtesian-products/
>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F15%2Fstamping-out-cartesian-products%2F&sa=D&sntz=1&usg=AFQjCNHJDOJ0IOsI6XRsg_9yuTscI4mqtQ>
>>>>
>>>
>>> So I had a look at your pdf,
>>> http://rodgersnotes.files.wordpress.com/2010/09/cartprodwordpress.pdf
>>> page 11
>>>
>>> and I think the idea you want to suggest, is to avoid duplicates (you
>>> called them 'cartesian products') by enforcing conditions.
>>> Though, since it is a graph db and not relational, not clear to me where
>>> this applies because in the graph db I don't have 'jointed' queries between
>>> tables,
>>> so the conditions I have are, at least in my case, properties (index on
>>> properties), and no-directional rels.
>>>
>>>
>>>>
>>>>
>>>> Try:
>>>>
>>>> return p, count (*)
>>>> order by count(*)
>>>>
>>>
>>> I run:
>>>
>>> profile MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name =
>>> 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order by
>>> count(*);
>>>
>>> and I've got: (see there are also duplicates in paths: is it because I
>>> have both (a)-[]->(b) and (a)<-[]-(b) ?)
>>>
>>> ==> +-----------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ---------------------------------------------------------------------+
>>> ==> | p
>>>
>>>
>>>                   | count(*) |
>>> ==> +-----------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ---------------------------------------------------------------------+
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 71185298]{proximity:68},Node[1401899]{id:21375850,name:"
>>> Topic3"},:P_Topic_Link[71185313]{proximity:32},Node[
>>> 1386672]{id:21245,name:"Topic2"}]                   | 1        |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 88675719]{proximity:28},Node[2594397]{id:31760062,name:"
>>> Topic4"},:P_Topic_Link[88675745]{proximity:23},Node[
>>> 1386672]{id:21245,name:"Topic2"}]           | 1        |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 30736000]{proximity:32},Node[2515502]{id:3106745,name:"
>>> Topic5"},:P_Topic_Link[30735974]{proximity:82},Node[
>>> 1386672]{id:21245,name:"Topic2"}] | 1        |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 68206383]{proximity:72},Node[1202629]{id:19635605,name:"
>>> Topic6"},:P_Topic_Link[68206440]{proximity:32},Node[
>>> 1386672]{id:21245,name:"Topic2"}]              | 1        |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 98898173]{proximity:23},Node[3329750]{id:38567205,name:"
>>> Topic7"},:P_Topic_Link[98898126]{proximity:124},Node[
>>> 1386672]{id:21245,name:"Topic2"}]                        | 1        |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 58107755]{proximity:55},Node[506613]{id:13841207,name:"
>>> Topic8"},:P_Topic_Link[58107766]{proximity:27},Node[
>>> 1386672]{id:21245,name:"Topic2"}]                             | 1
>>>  |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 98898173]{proximity:23},Node[3329750]{id:38567205,name:"
>>> Topic7"},:P_Topic_Link[1025873]{proximity:124},Node[
>>> 1386672]{id:21245,name:"Topic2"}]                         | 1        |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 5662626]{proximity:47},Node[736816]{id:157427,name:"
>>> Topic9"},:P_Topic_Link[5662565]{proximity:138},Node[
>>> 1386672]{id:21245,name:"Topic2"}]                  | 1        |
>>> ==> | [Node[103105]{id:1092923,name:"Topic1"},:P_Topic_Link[
>>> 5662626]{proximity:47},Node[736816]{id:157427,name:"
>>> Topic9"},:P_Topic_Link[1025864]{proximity:138},Node[
>>> 1386672]{id:21245,name:"Topic2"}]                  | 1        |
>>> ==> +-----------------------------------------------------------
>>> ------------------------------------------------------------
>>> ------------------------------------------------------------
>>> ---------------------------------------------------------------------+
>>> ==> 9 rows
>>> ==>
>>> ==> ColumnFilter(0)
>>> ==>   |
>>> ==>   +Sort
>>> ==>     |
>>> ==>     +EagerAggregation
>>> ==>       |
>>> ==>       +ColumnFilter(1)
>>> ==>         |
>>> ==>         +ExtractPath
>>> ==>           |
>>> ==>           +Filter
>>> ==>             |
>>> ==>             +TraversalMatcher
>>> ==>
>>> ==> +------------------+---------+---------+-------------+------
>>> ------------------------------------------------------------
>>> ----------------+
>>> ==> |         Operator |    Rows |  DbHits | Identifiers |
>>>                                                              Other |
>>> ==> +------------------+---------+---------+-------------+------
>>> ------------------------------------------------------------
>>> ----------------+
>>> ==> |  ColumnFilter(0) |       9 |       0 |             |
>>>                                           keep columns p, count(*) |
>>> ==> |             Sort |       9 |       0 |             | Cached(
>>>  INTERNAL_AGGREGATE931614f3-4def-4fc4-a80b-c6fca3839817 of type
>>> Integer) |
>>> ==> | EagerAggregation |       9 |       0 |             |
>>>                                                                  p |
>>> ==> |  ColumnFilter(1) |       9 |       0 |             |
>>>                                               keep columns p, n, m |
>>> ==> |      ExtractPath |       9 |       0 |           p |
>>>                                                                    |
>>> ==> |           Filter |       9 | 3032385 |             |
>>>  (hasLabel(m:Topic(0)) AND Property(m,name(1)) == {  AUTOSTRING1}) |
>>> ==> | TraversalMatcher | 1010795 | 1024307 |             |
>>>                                                  m,   UNNAMED36, m |
>>> ==> +------------------+---------+---------+-------------+------
>>> ------------------------------------------------------------
>>> ----------------+
>>> ==>
>>>
>>>>
>>>>
>>>>
>>>> Without me looking at the raw data, and the query result, you
>>>> seem to have many operations going on. So, you have a lot of rows in
>>>> the profile output.
>>>>
>>>
>>> Only 9
>>>
>>>
>>>>  As a general rule, the more rows there are in the
>>>> profile, the slower the response time is.
>>>> ie. the more complex the query, the slower it is.
>>>>
>>>>
>>>> If I were looking at this, I would try to isolate which part of
>>>> the query is the slow part.  The Return clause, or the Match clause?
>>>>
>>>>
>>>> You've already tried the response times with the data.
>>>> Try to simply:
>>>> return count(*) .
>>>>
>>>
>>> I run:
>>> MATCH (n:Topic) , (m:Topic), p = (n)-[*0..2]-(m) where n.name =
>>> 'Topic1' and m.name = 'Topic2' with p, n, m return p, count(*) order by
>>> count(*);
>>>
>>> and obtain 9 rows in 182799 ms
>>>
>>> I run:
>>> MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name =
>>> 'Topic2' with n, m return count(*);
>>>
>>> and obtain 856ms
>>>
>>>
>>> profile MATCH (n:Topic), (m:Topic) where n.name = 'Topic1' and m.name =
>>> 'Topic2' with n, m return count(*);
>>>
>>> results in:
>>>
>>>
>>> ==> ColumnFilter
>>> ==>   |
>>> ==>   +EagerAggregation
>>> ==>     |
>>> ==>     +SchemaIndex(0)
>>> ==>       |
>>> ==>       +SchemaIndex(1)
>>> ==>
>>> ==> +------------------+------+--------+-------------+----------
>>> ---------------------+
>>> ==> |         Operator | Rows | DbHits | Identifiers |
>>>       Other |
>>> ==> +------------------+------+--------+-------------+----------
>>> ---------------------+
>>> ==> |     ColumnFilter |    1 |      0 |             |         keep
>>> columns count(*) |
>>> ==> | EagerAggregation |    1 |      0 |             |
>>>             |
>>> ==> |   SchemaIndex(0) |    1 |      2 |        m, m | {  AUTOSTRING1};
>>> :Topic(name) |
>>> ==> |   SchemaIndex(1) |    1 |      2 |        n, n | {  AUTOSTRING0};
>>> :Topic(name) |
>>> ==> +------------------+------+--------+-------------+----------
>>> ---------------------+
>>>
>>>
>>>> How many seconds response time is that, versus the original query?
>>>> What is the resulting profile?
>>>>
>>>>
>>>>
>>>
>>> So, it looks like it actually take huge time in traversing the graph,
>>> while reasonable time '~900ms' to match a fullstring node.
>>>
>>> *Any idea for improving performance of traversal??*
>>>
>>> *It is a real problem, since also for getting results of first neighbors
>>> of a node, I met the same problem which makes currently unfeasible for
>>> production :*
>>> *Anyone with real case of similar size graph and structure trying to
>>> perform a similar query?*
>>>
>>> as example, this query to obtain first neighbors of node Topic44:
>>>
>>> MATCH (n:Topic) , (m), p = (n)-[*0..1]-(m)
>>> where n.name = 'Topic44'
>>> with p, n, m
>>> return p, reduce(totProximity = 0, n IN relationships(p)| totProximity +
>>> n.proximity) AS pathProximity order by pathProximity DESC LIMIT 6
>>>
>>> returns
>>> 6 rows in ~65000 ms VS 6 rows in less than a second with a NoSQL.
>>>
>>> Any idea?
>>>
>>> thank you guys for helping!! Hope to find a solution soon..
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>> See also the tuning presentations I've done:
>>>> http://rodgersnotes.wordpress.com/2010/09/14/oracle-performance-tuning/
>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2010%2F09%2F14%2Foracle-performance-tuning%2F&sa=D&sntz=1&usg=AFQjCNE0XK_XcNk5YBj806h6a1OJHr0glA>
>>>> http://rodgersnotes.wordpress.com/2014/06/08/tuning-the-untu
>>>> nable-when-indexes-and-optimizer-dont-help-2/
>>>> <http://www.google.com/url?q=http%3A%2F%2Frodgersnotes.wordpress.com%2F2014%2F06%2F08%2Ftuning-the-untunable-when-indexes-and-optimizer-dont-help-2%2F&sa=D&sntz=1&usg=AFQjCNFgTfu5bnjPw6boHWttJpzQBtaNgw>
>>>> They are quick reads.
>>>>
>>>> thank you, seen them,
>>> they are about SQL tuning mostly:
>>> I've just used neo4j strucutre to store a graph with same label on 4M
>>> topics (I MUST keep it with one label), index on topic(name) property and
>>> used cypher to query the db,
>>> this is my data structure.
>>>
>>> I've put a number of principles and principles in there, that you might
>>>> apply.
>>>> ie. Could you create the NEO4J equivalent of a temp table?
>>>>
>>>>
>>>> Hope this helps.
>>>>
>>>>
>>>> On Thursday, October 9, 2014 2:41:47 AM UTC-5, gg4u wrote:
>>>>>
>>>>> Hi Micheal, thank you.
>>>>> sure I post my profile result here below !
>>>>>
>>>>>
>>>>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] Traversing Large (weighted) graphs: performance, data structure, indexes

Reply via email to