Re: [Neo4j] Slow cypher query in relatively small data set, and help writing a related query

Michael Hunger Tue, 28 Jan 2014 17:10:34 -0800

Neat.

The idea is to limit down the # of rows processed at a time as much as possible.


In your case the subgraph at the beginning is quite limiting, and only then you 
go further and match the rest (which expands again)

I got your global query down to 30s by reordering the matches

Starting from the BusRels (fewest entries) (but is not a big difference you can 
also start with c:Company) this order of matches proved most effective to prune 
down the # of rows processed in between.

match 
(c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p)
with distinct c,p
match 
(c)-[:operates_in_sector]->()-[:has_child*0..1]-()<-[:operates_in_sector]-(p)
with distinct c,p
match (c)-[:operates_in_region]->()<-[:operates_in_region]-(p)
with distinct c,p
with c.name as name , count(p) as cnt
return count(*)
;

Hope it helps.

Michael

Am 28.01.2014 um 16:12 schrieb Ziv Unger <[email protected]>:

> Sorry for the email spam, but based on your suggestion I stripped the Label 
> lookups on the paths that can't be any other type of label, and distilled the 
> query down to this:
> 
> match (c:Company {id: 
> 6})-[:operates_in_region]->(regions)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->()-[:looks_for]->(busrels)<-[:has_objective]-(p)
> match 
> (c)-[:operates_in_sector]->(sectors)-[:has_child*0..1]-()<-[:operates_in_sector]-(p)
> return p.name, collect(distinct(busrels.name)), 
> collect(distinct(sectors.name)), collect(distinct(regions.name));
> 
> Which is incredibly returning at 22ms! That's pretty damn impressive. I've 
> implemented the same changes in the summary query, but still can't get it to 
> return in decent amount of time. I'll keep trying this side, any more help in 
> that regard is much appreciated.
> 
> Thanks again!
> 
> On Tuesday, 28 January 2014 16:55:35 UTC+2, Ziv Unger wrote:
> Completely missed your last line with the modified query!
> 
> I just ran that in a console and it returns around 33ms!
> 
> I tried having all 3 matches comma separated in one statement, which raised 
> the time to over 700ms, and had it as a continuation of the cypher pattern 
> from the first (p:Person), but never thought to just have two. I was working 
> on the theory from the Webinar in which you said that multiple patterns 
> should be separated out into separate MATCH statements. How  does one know 
> what will make a difference like this, as in my mind I'd expect the results 
> to be similar?
> 
> On Tuesday, 28 January 2014 14:17:08 UTC+2, Michael Hunger wrote:
> 
> I think you already did a really good job.
> 
> How did you measure? With browser? As Wes said in the webinar that's quite 
> unreliable. So either measure with neo4j-shell or with your own, parametrized 
> code.
> 
> What kind of indexes/constraints do you have in your db?
> e.g. unique constraint on :Company(id) or :Region(name) ?
> 
> You might drop label checks in places where due to the relationship-types 
> there is only one label possible anyway.
> 
> Did you try to pull more of the pattern in a single match?
> 
> Also probably sensible to look at the cardinalities in between and try to 
> reduce them if they are getting too high? (e.g. with WITH distinct 
> c,p,regions.name, busrels.name
> 
> How do individual parts of the pattern perform?
> 
>> match (c:Company {id: 
>> 6})-[:operates_in_region]->(regions:Region)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-[:has_objective]-(p)
>> match 
>> (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
>> return p.name, collect(distinct(busrels.name)), 
>> collect(distinct(sectors.name)), collect(distinct(regions.name)) 
> Any chance to get your db (privately) for testing?
> 
> Am 28.01.2014 um 12:18 schrieb Ziv Unger <[email protected]>:
> 
>> Hi
>> 
>> I have a relatively small database, consisting of the following elements, 
>> each of which is defined as a Label:
>> 
>> Company
>> BusRel (business relationship)
>> Region
>> Sector
>> 
>> Sectors are a parent/child hierarchy (max. 1 level deep), with a 
>> [:has_child] relationship between a parent and children. (approximately 122 
>> nodes total)
>> 
>> Regions are also a parent/child hierarchy, but are present for example as: 
>> (:Region {name: 'Africa'})-[:has_child]->(:Region {name: 'South 
>> Africa'})-[:has_child]->(:Region {name: 'Gauteng'})-[:has_child]->(:Region 
>> {name: 'Pretoria'}), ie. Continent all the way down to City with a 
>> :has_child relationship from parent to child. (approximately 50000 nodes 
>> total)
>> 
>> BusRel are business relationships. There are matching pairs which are linked 
>> to each other (bi-directional links), for example: {:BusRel {name: 'Looking 
>> for a distributor'})<-[:looks_for]->(:BusRel {name: 'Looking to 
>> distribute'})  (approximately 18 nodes total)
>> 
>> Companies aren't linked to each other directly, but rather via one or more 
>> of the Labels above. A company could have multiple BusRels and Sectors, but 
>> only one Region (at present). (approximately 10000 nodes total)
>> 
>> I've attached a screenshot from the web admin to illustrate an example, the 
>> nodes being:
>> 
>> Blue: Company
>> Grey: Region
>> Yellow: BusRel
>> Orange: Sector
>> 
>> I've written a query to start at a specified company (there is a unique 
>> constraint on Company property "id") and retrieve all other Companies which 
>> share (1) one or more corresponding business relationship (2) the same 
>> region (3) one or more sectors. It looks like this:
>> 
>> match (c:Company {id: 
>> 6})-[:operates_in_region]->(regions:Region)<-[:operates_in_region]-(p:Company)
>> match 
>> (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
>> match 
>> (c)-[:has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-[:has_objective]-(p)
>> return p.name, collect(distinct(busrels.name)), 
>> collect(distinct(sectors.name)), collect(distinct(regions.name)) 
>> 
>> This is the product of some query tuning based on some reading and a recent 
>> video by Wes and Mark. The problem is that this query returns in about 100 - 
>> 120ms, which seems slow to me. (please correct me if I'm wrong, but I've 
>> seen examples of far more complex db's and queries returning in under 20ms)
>> 
>> I've tried profiling it, which returns the following:
>> 
>> ColumnFilter(symKeys=["p.name", "  
>> INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f", "  
>> INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457", "  
>> INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f"], 
>> returnItemNames=["p.name", "collect(distinct(busrels.name))", 
>> "collect(distinct(sectors.name))", "collect(distinct(regions.name))"], 
>> _rows=4, _db_hits=0)
>> EagerAggregation(keys=["Cached(p.name of type Any)"], aggregates=["(  
>> INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f,Distinct(Collect(Property(busrels,name(1))),Property(busrels,name(1))))",
>>  "(  
>> INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457,Distinct(Collect(Property(sectors,name(1))),Property(sectors,name(1))))",
>>  "(  
>> INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f,Distinct(Collect(Property(regions,name(1))),Property(regions,name(1))))"],
>>  _rows=4, _db_hits=48)
>>   Extract(symKeys=["sectors", "  UNNAMED170", "  UNNAMED178", "  
>> UNNAMED215", "  UNNAMED150", "  UNNAMED110", "regions", "  UNNAMED274", "p", 
>> "  UNNAMED25", "c", "  UNNAMED65", "  UNNAMED235", "busrels", "  
>> UNNAMED243"], exprKeys=["p.name"], _rows=10, _db_hits=10)
>>     Filter(pred="(((hasLabel(  UNNAMED235:BusRel(3)) AND hasLabel(  
>> UNNAMED235:BusRel(3))) AND hasLabel(busrels:BusRel(3))) AND 
>> hasLabel(busrels:BusRel(3)))", _rows=10, _db_hits=0)
>>       PatternMatch(g="(c)-['  UNNAMED215']-(  UNNAMED235),(p)-['  
>> UNNAMED274']-(busrels),(  UNNAMED235)-['  UNNAMED243']-(busrels)", _rows=10, 
>> _db_hits=0)
>>         Filter(pred="(((hasLabel(sectors:Sector(0)) AND hasLabel(  
>> UNNAMED170:Sector(0))) AND hasLabel(  UNNAMED170:Sector(0))) AND 
>> hasLabel(sectors:Sector(0)))", _rows=130, _db_hits=0)
>>           PatternMatch(g="(c)-['  UNNAMED110']-(sectors),(p)-['  
>> UNNAMED178']-(  UNNAMED170),(  UNNAMED170)-['  UNNAMED150']-(sectors)", 
>> _rows=130, _db_hits=2733)
>>             Filter(pred="hasLabel(p:Company(2))", _rows=584, _db_hits=0)
>>               TraversalMatcher(trail="(c)-[  UNNAMED25:operates_in_region 
>> WHERE (hasLabel(NodeIdentifier():Region(1)) AND 
>> hasLabel(NodeIdentifier():Region(1))) AND true]->(regions)<-[  
>> UNNAMED65:operates_in_region WHERE hasLabel(NodeIdentifier():Company(2)) AND 
>> true]-(p)", _rows=584, _db_hits=586)
>> 
>> There hardly seem to be enough db_hits or rows being processed to warrant a 
>> 100ms execution time. I'd be happy retrieving only the first matched region, 
>> sector and busrel to optimize the query so that it would stop searching for 
>> more matches between two companies and return faster. Ie. I just need one 
>> match across the bridging nodes, not all of them, but I'm not sure if that's 
>> possible / how to do that.
>> 
>> This is running on my development machine, which is an i5 2500K with 8GB of 
>> RAM and the DB is on a fast SSD. I've tried increasing the JVM heap size 
>> from the default to 1GB, then 2GB and then 4GB with the same results. Any 
>> help / tips would be appreciated.
>> 
>> The second part of the question involves writing a "summary" query, to 
>> create a paged list of Companies and total number of matches, using the same 
>> criteria. I've come up with this, sans paging:
>> 
>> match 
>> (c:Company)-[:operates_in_region]->(:Region)<-[:operates_in_region]-(p:Company)
>> match 
>> (c)-[:operates_in_sector]->(:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
>> match 
>> (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p)
>> return c.name, count(p)
>> 
>> But I suspect there is a serious flaw in the query as this never returns. 
>> (or at least it hasn't before hitting the 30s maximum execution time I've 
>> set)
>> 
>> How would I go about constructing something like that?
>> 
>> Many thanks in advance.
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>> <debug.png>
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Slow cypher query in relatively small data set, and help writing a related query

Reply via email to