Re: [Neo4j] Slow cypher query in relatively small data set, and help writing a related query

Ziv Unger Tue, 28 Jan 2014 07:12:22 -0800

Sorry for the email spam, but based on your suggestion I stripped the Label 
lookups on the paths that can't be any other type of label, and distilled 
the query down to this:


match (c:Company {id: 
6})-[:operates_in_region]->(regions)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->()-[:looks_for]->(busrels)<-[:has_objective]-(p)
match 
(c)-[:operates_in_sector]->(sectors)-[:has_child*0..1]-()<-[:operates_in_sector]-(p)
return p.name, collect(distinct(busrels.name)), 
collect(distinct(sectors.name)), collect(distinct(regions.name));

Which is incredibly returning at 22ms! That's pretty damn impressive. I've 
implemented the same changes in the summary query, but still can't get it 
to return in decent amount of time. I'll keep trying this side, any more 
help in that regard is much appreciated.

Thanks again!

On Tuesday, 28 January 2014 16:55:35 UTC+2, Ziv Unger wrote:
>
> Completely missed your last line with the modified query!
>
> I just ran that in a console and it returns around 33ms!
>
> I tried having all 3 matches comma separated in one statement, which 
> raised the time to over 700ms, and had it as a continuation of the cypher 
> pattern from the first (p:Person), but never thought to just have two. I 
> was working on the theory from the Webinar in which you said that multiple 
> patterns should be separated out into separate MATCH statements. How  does 
> one know what will make a difference like this, as in my mind I'd expect 
> the results to be similar?
>
> On Tuesday, 28 January 2014 14:17:08 UTC+2, Michael Hunger wrote:
>>
>>
>> I think you already did a really good job.
>>
>> How did you measure? With browser? As Wes said in the webinar that's 
>> quite unreliable. So either measure with neo4j-shell or with your own, 
>> parametrized code.
>>
>> What kind of indexes/constraints do you have in your db?
>> e.g. unique constraint on :Company(id) or :Region(name) ?
>>
>> You might drop label checks in places where due to the relationship-types 
>> there is only one label possible anyway.
>>
>> Did you try to pull more of the pattern in a single match?
>>
>> Also probably sensible to look at the cardinalities in between and try to 
>> reduce them if they are getting too high? (e.g. with WITH distinct c,p,
>> regions.name, busrels.name
>>
>> How do individual parts of the pattern perform?
>>
>> match (c:Company {id: 
>> 6})-[:operates_in_region]->(regions:Region)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-[:has_objective]-(p)
>>
>> match 
>> (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
>> return p.name, collect(distinct(busrels.name)), collect(distinct(
>> sectors.name)), collect(distinct(regions.name)) 
>>
>> Any chance to get your db (privately) for testing?
>>
>> Am 28.01.2014 um 12:18 schrieb Ziv Unger <[email protected]>:
>>
>> Hi
>>
>> I have a relatively small database, consisting of the following elements, 
>> each of which is defined as a Label:
>>
>> Company
>> BusRel (business relationship)
>> Region
>> Sector
>>
>> Sectors are a parent/child hierarchy (max. 1 level deep), with a 
>> [:has_child] relationship between a parent and children. (approximately 122 
>> nodes total)
>>
>> Regions are also a parent/child hierarchy, but are present for example 
>> as: (:Region {name: 'Africa'})-[:has_child]->(:Region {name: 'South 
>> Africa'})-[:has_child]->(:Region {name: 'Gauteng'})-[:has_child]->(:Region 
>> {name: 'Pretoria'}), ie. Continent all the way down to City with a 
>> :has_child relationship from parent to child. (approximately 50000 nodes 
>> total)
>>
>> BusRel are business relationships. There are matching pairs which are 
>> linked to each other (bi-directional links), for example: {:BusRel {name: 
>> 'Looking for a distributor'})<-[:looks_for]->(:BusRel {name: 'Looking to 
>> distribute'})  (approximately 18 nodes total)
>>
>> Companies aren't linked to each other directly, but rather via one or 
>> more of the Labels above. A company could have multiple BusRels and 
>> Sectors, but only one Region (at present). (approximately 10000 nodes total)
>>
>> I've attached a screenshot from the web admin to illustrate an example, 
>> the nodes being:
>>
>> Blue: Company
>> Grey: Region
>> Yellow: BusRel
>> Orange: Sector
>>
>> I've written a query to start at a specified company (there is a unique 
>> constraint on Company property "id") and retrieve all other Companies which 
>> share (1) one or more corresponding business relationship (2) the same 
>> region (3) one or more sectors. It looks like this:
>>
>> match (c:Company {id: 
>> 6})-[:operates_in_region]->(regions:Region)<-[:operates_in_region]-(p:Company)
>> match 
>> (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
>> match 
>> (c)-[:has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<-[:has_objective]-(p)
>> return p.name, collect(distinct(busrels.name)), collect(distinct(
>> sectors.name)), collect(distinct(regions.name)) 
>>
>> This is the product of some query tuning based on some reading and a 
>> recent video by Wes and Mark. The problem is that this query returns in 
>> about 100 - 120ms, which seems slow to me. (please correct me if I'm wrong, 
>> but I've seen examples of far more complex db's and queries returning in 
>> under 20ms)
>>
>> I've tried profiling it, which returns the following:
>>
>> ColumnFilter(symKeys=["p.name", " 
>>  INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f", " 
>>  INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457", " 
>>  INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f"], 
>> returnItemNames=["p.name", "collect(distinct(busrels.name))", 
>> "collect(distinct(sectors.name))", "collect(distinct(regions.name))"], 
>> _rows=4, _db_hits=0)
>> EagerAggregation(keys=["Cached(p.name of type Any)"], aggregates=["( 
>>  
>> INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f,Distinct(Collect(Property(busrels,name(1))),Property(busrels,name(1))))",
>>  
>> "( 
>>  
>> INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457,Distinct(Collect(Property(sectors,name(1))),Property(sectors,name(1))))",
>>  
>> "( 
>>  
>> INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f,Distinct(Collect(Property(regions,name(1))),Property(regions,name(1))))"],
>>  
>> _rows=4, _db_hits=48)
>>   Extract(symKeys=["sectors", "  UNNAMED170", "  UNNAMED178", " 
>>  UNNAMED215", "  UNNAMED150", "  UNNAMED110", "regions", "  UNNAMED274", 
>> "p", "  UNNAMED25", "c", "  UNNAMED65", "  UNNAMED235", "busrels", " 
>>  UNNAMED243"], exprKeys=["p.name"], _rows=10, _db_hits=10)
>>     Filter(pred="(((hasLabel(  UNNAMED235:BusRel(3)) AND hasLabel( 
>>  UNNAMED235:BusRel(3))) AND hasLabel(busrels:BusRel(3))) AND 
>> hasLabel(busrels:BusRel(3)))", _rows=10, _db_hits=0)
>>       PatternMatch(g="(c)-['  UNNAMED215']-(  UNNAMED235),(p)-[' 
>>  UNNAMED274']-(busrels),(  UNNAMED235)-['  UNNAMED243']-(busrels)", 
>> _rows=10, _db_hits=0)
>>         Filter(pred="(((hasLabel(sectors:Sector(0)) AND hasLabel( 
>>  UNNAMED170:Sector(0))) AND hasLabel(  UNNAMED170:Sector(0))) AND 
>> hasLabel(sectors:Sector(0)))", _rows=130, _db_hits=0)
>>           PatternMatch(g="(c)-['  UNNAMED110']-(sectors),(p)-[' 
>>  UNNAMED178']-(  UNNAMED170),(  UNNAMED170)-['  UNNAMED150']-(sectors)", 
>> _rows=130, _db_hits=2733)
>>             Filter(pred="hasLabel(p:Company(2))", _rows=584, _db_hits=0)
>>               TraversalMatcher(trail="(c)-[  UNNAMED25:operates_in_region 
>> WHERE (hasLabel(NodeIdentifier():Region(1)) AND 
>> hasLabel(NodeIdentifier():Region(1))) AND true]->(regions)<-[ 
>>  UNNAMED65:operates_in_region WHERE hasLabel(NodeIdentifier():Company(2)) 
>> AND true]-(p)", _rows=584, _db_hits=586)
>>
>> There hardly seem to be enough db_hits or rows being processed to warrant 
>> a 100ms execution time. I'd be happy retrieving only the first matched 
>> region, sector and busrel to optimize the query so that it would stop 
>> searching for more matches between two companies and return faster. Ie. I 
>> just need one match across the bridging nodes, not all of them, but I'm not 
>> sure if that's possible / how to do that.
>>
>> This is running on my development machine, which is an i5 2500K with 8GB 
>> of RAM and the DB is on a fast SSD. I've tried increasing the JVM heap size 
>> from the default to 1GB, then 2GB and then 4GB with the same results. Any 
>> help / tips would be appreciated.
>>
>> The second part of the question involves writing a "summary" query, to 
>> create a paged list of Companies and total number of matches, using the 
>> same criteria. I've come up with this, sans paging:
>>
>> match 
>> (c:Company)-[:operates_in_region]->(:Region)<-[:operates_in_region]-(p:Company)
>> match 
>> (c)-[:operates_in_sector]->(:Sector)-[:has_child*0..1]-(:Sector)<-[:operates_in_sector]-(p)
>> match 
>> (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p)
>> return c.name, count(p)
>>
>> But I suspect there is a serious flaw in the query as this never returns. 
>> (or at least it hasn't before hitting the 30s maximum execution time I've 
>> set)
>>
>> How would I go about constructing something like that?
>>
>> Many thanks in advance.
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>> <debug.png>
>>
>>
>>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Slow cypher query in relatively small data set, and help writing a related query

Reply via email to