Re: [Neo4j] Slow cypher query in relatively small data set, and help writing a related query

Ziv Unger Tue, 28 Jan 2014 20:48:13 -0800

Many thanks for your help, it has been invaluable!

Ziv



On Wed, Jan 29, 2014 at 3:10 AM, Michael Hunger <
[email protected]> wrote:

> Neat.
>
> The idea is to limit down the # of rows processed at a time as much as
> possible.
>
> In your case the subgraph at the beginning is quite limiting, and only
> then you go further and match the rest (which expands again)
>
> I got your global query down to 30s by reordering the matches
>
> Starting from the BusRels (fewest entries) (but is not a big difference
> you can also start with c:Company) this order of matches proved most
> effective to prune down the # of rows processed in between.
>
> match
> (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p)
> with distinct c,p
> match
> (c)-[:operates_in_sector]->()-[:has_child*0..1]-()<-[:operates_in_sector]-(p)
> with distinct c,p
> match (c)-[:operates_in_region]->()<-[:operates_in_region]-(p)
> with distinct c,p
> with c.name as name , count(p) as cnt
> return count(*)
> ;
>
> Hope it helps.
>
> Michael
>
> Am 28.01.2014 um 16:12 schrieb Ziv Unger <[email protected]>:
>
> Sorry for the email spam, but based on your suggestion I stripped the
> Label lookups on the paths that can't be any other type of label, and
> distilled the query down to this:
>
> match (c:Company {id:
> 6})-[:operates_in_region]->(regions)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->()-[:looks_for]->(busrels)<-[:has_objective]-(p)
> match
> (c)-[:operates_in_sector]->(sectors)-[:has_child*0..1]-()<-[:operates_in_sector]-(p)
> return p.name, collect(distinct(busrels.name)), collect(distinct(
> sectors.name)), collect(distinct(regions.name));
>
> Which is incredibly returning at 22ms! That's pretty damn impressive. I've
> implemented the same changes in the summary query, but still can't get it
> to return in decent amount of time. I'll keep trying this side, any more
> help in that regard is much appreciated.
>
> Thanks again!
>
> On Tuesday, 28 January 2014 16:55:35 UTC+2, Ziv Unger wrote:
>>
>> Completely missed your last line with the modified query!
>>
>> I just ran that in a console and it returns around 33ms!
>>
>> I tried having all 3 matches comma separated in one statement, which
>> raised the time to over 700ms, and had it as a continuation of the cypher
>> pattern from the first (p:Person), but never thought to just have two. I
>> was working on the theory from the Webinar in which you said that multiple
>> patterns should be separated out into separate MATCH statements. How  does
>> one know what will make a difference like this, as in my mind I'd expect
>> the results to be similar?
>>
>> On Tuesday, 28 January 2014 14:17:08 UTC+2, Michael Hunger wrote:
>>>
>>>
>>> I think you already did a really good job.
>>>
>>> How did you measure? With browser? As Wes said in the webinar that's
>>> quite unreliable. So either measure with neo4j-shell or with your own,
>>> parametrized code.
>>>
>>> What kind of indexes/constraints do you have in your db?
>>> e.g. unique constraint on :Company(id) or :Region(name) ?
>>>
>>> You might drop label checks in places where due to the
>>> relationship-types there is only one label possible anyway.
>>>
>>> Did you try to pull more of the pattern in a single match?
>>>
>>> Also probably sensible to look at the cardinalities in between and try
>>> to reduce them if they are getting too high? (e.g. with WITH distinct c,p,
>>> regions.name, busrels.name
>>>
>>> How do individual parts of the pattern perform?
>>>
>>> match (c:Company {id: 6})-[:operates_in_region]->(
>>> regions:Region)<-[:operates_in_region]-(p:Company),(c)-[:
>>> has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-
>>> [:has_objective]-(p)
>>>
>>> match (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0.
>>> .1]-(:Sector)<-[:operates_in_sector]-(p)
>>> return p.name, collect(distinct(busrels.name)), collect(distinct(
>>> sectors.name)), collect(distinct(regions.name))
>>>
>>> Any chance to get your db (privately) for testing?
>>>
>>> Am 28.01.2014 um 12:18 schrieb Ziv Unger <[email protected]>:
>>>
>>> Hi
>>>
>>> I have a relatively small database, consisting of the following
>>> elements, each of which is defined as a Label:
>>>
>>> Company
>>> BusRel (business relationship)
>>> Region
>>> Sector
>>>
>>> Sectors are a parent/child hierarchy (max. 1 level deep), with a
>>> [:has_child] relationship between a parent and children. (approximately 122
>>> nodes total)
>>>
>>> Regions are also a parent/child hierarchy, but are present for example
>>> as: (:Region {name: 'Africa'})-[:has_child]->(:Region {name: 'South
>>> Africa'})-[:has_child]->(:Region {name: 'Gauteng'})-[:has_child]->(:Region
>>> {name: 'Pretoria'}), ie. Continent all the way down to City with a
>>> :has_child relationship from parent to child. (approximately 50000 nodes
>>> total)
>>>
>>> BusRel are business relationships. There are matching pairs which are
>>> linked to each other (bi-directional links), for example: {:BusRel {name:
>>> 'Looking for a distributor'})<-[:looks_for]->(:BusRel {name: 'Looking
>>> to distribute'})  (approximately 18 nodes total)
>>>
>>> Companies aren't linked to each other directly, but rather via one or
>>> more of the Labels above. A company could have multiple BusRels and
>>> Sectors, but only one Region (at present). (approximately 10000 nodes total)
>>>
>>> I've attached a screenshot from the web admin to illustrate an example,
>>> the nodes being:
>>>
>>> Blue: Company
>>> Grey: Region
>>> Yellow: BusRel
>>> Orange: Sector
>>>
>>> I've written a query to start at a specified company (there is a unique
>>> constraint on Company property "id") and retrieve all other Companies which
>>> share (1) one or more corresponding business relationship (2) the same
>>> region (3) one or more sectors. It looks like this:
>>>
>>> match (c:Company {id: 6})-[:operates_in_region]->(
>>> regions:Region)<-[:operates_in_region]-(p:Company)
>>> match (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0.
>>> .1]-(:Sector)<-[:operates_in_sector]-(p)
>>> match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(
>>> busrels:BusRel)<-[:has_objective]-(p)
>>> return p.name, collect(distinct(busrels.name)), collect(distinct(
>>> sectors.name)), collect(distinct(regions.name))
>>>
>>> This is the product of some query tuning based on some reading and a
>>> recent video by Wes and Mark. The problem is that this query returns in
>>> about 100 - 120ms, which seems slow to me. (please correct me if I'm wrong,
>>> but I've seen examples of far more complex db's and queries returning in
>>> under 20ms)
>>>
>>> I've tried profiling it, which returns the following:
>>>
>>> ColumnFilter(symKeys=["p.name", "  
>>> INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f",
>>> "  INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457", "
>>>  INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f"],
>>> returnItemNames=["p.name", "collect(distinct(busrels.name))",
>>> "collect(distinct(sectors.name))", "collect(distinct(regions.name))"],
>>> _rows=4, _db_hits=0)
>>> EagerAggregation(keys=["Cached(p.name of type Any)"], aggregates=["(
>>>  INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f,
>>> Distinct(Collect(Property(busrels,name(1))),Property(busrels,name(1))))",
>>> "(  INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457,
>>> Distinct(Collect(Property(sectors,name(1))),Property(sectors,name(1))))",
>>> "(  INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f,
>>> Distinct(Collect(Property(regions,name(1))),Property(regions,name(1))))"],
>>> _rows=4, _db_hits=48)
>>>   Extract(symKeys=["sectors", "  UNNAMED170", "  UNNAMED178", "
>>>  UNNAMED215", "  UNNAMED150", "  UNNAMED110", "regions", "  UNNAMED274",
>>> "p", "  UNNAMED25", "c", "  UNNAMED65", "  UNNAMED235", "busrels", "
>>>  UNNAMED243"], exprKeys=["p.name"], _rows=10, _db_hits=10)
>>>     Filter(pred="(((hasLabel(  UNNAMED235:BusRel(3)) AND hasLabel(
>>>  UNNAMED235:BusRel(3))) AND hasLabel(busrels:BusRel(3))) AND
>>> hasLabel(busrels:BusRel(3)))", _rows=10, _db_hits=0)
>>>       PatternMatch(g="(c)-['  UNNAMED215']-(  UNNAMED235),(p)-['
>>>  UNNAMED274']-(busrels),(  UNNAMED235)-['  UNNAMED243']-(busrels)",
>>> _rows=10, _db_hits=0)
>>>         Filter(pred="(((hasLabel(sectors:Sector(0)) AND hasLabel(
>>>  UNNAMED170:Sector(0))) AND hasLabel(  UNNAMED170:Sector(0))) AND
>>> hasLabel(sectors:Sector(0)))", _rows=130, _db_hits=0)
>>>           PatternMatch(g="(c)-['  UNNAMED110']-(sectors),(p)-['
>>>  UNNAMED178']-(  UNNAMED170),(  UNNAMED170)-['  UNNAMED150']-(sectors)",
>>> _rows=130, _db_hits=2733)
>>>             Filter(pred="hasLabel(p:Company(2))", _rows=584, _db_hits=0)
>>>               TraversalMatcher(trail="(c)-[
>>>  UNNAMED25:operates_in_region WHERE (hasLabel(NodeIdentifier():Region(1))
>>> AND hasLabel(NodeIdentifier():Region(1))) AND true]->(regions)<-[
>>>  UNNAMED65:operates_in_region WHERE hasLabel(NodeIdentifier():Company(2))
>>> AND true]-(p)", _rows=584, _db_hits=586)
>>>
>>> There hardly seem to be enough db_hits or rows being processed to
>>> warrant a 100ms execution time. I'd be happy retrieving only the first
>>> matched region, sector and busrel to optimize the query so that it would
>>> stop searching for more matches between two companies and return faster.
>>> Ie. I just need one match across the bridging nodes, not all of them, but
>>> I'm not sure if that's possible / how to do that.
>>>
>>> This is running on my development machine, which is an i5 2500K with 8GB
>>> of RAM and the DB is on a fast SSD. I've tried increasing the JVM heap size
>>> from the default to 1GB, then 2GB and then 4GB with the same results. Any
>>> help / tips would be appreciated.
>>>
>>> The second part of the question involves writing a "summary" query, to
>>> create a paged list of Companies and total number of matches, using the
>>> same criteria. I've come up with this, sans paging:
>>>
>>> match (c:Company)-[:operates_in_region]->(:Region)<-[:
>>> operates_in_region]-(p:Company)
>>> match (c)-[:operates_in_sector]->(:Sector)-[:has_child*0..1]-(:
>>> Sector)<-[:operates_in_sector]-(p)
>>> match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:
>>> BusRel)<-[:has_objective]-(p)
>>> return c.name, count(p)
>>>
>>> But I suspect there is a serious flaw in the query as this never
>>> returns. (or at least it hasn't before hitting the 30s maximum execution
>>> time I've set)
>>>
>>> How would I go about constructing something like that?
>>>
>>> Many thanks in advance.
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>> <debug.png>
>>>
>>>
>>>
> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
>
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "Neo4j" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/neo4j/BIlliXZKzF8/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Slow cypher query in relatively small data set, and help writing a related query

Reply via email to