Many thanks for your help, it has been invaluable! Ziv
On Wed, Jan 29, 2014 at 3:10 AM, Michael Hunger < [email protected]> wrote: > Neat. > > The idea is to limit down the # of rows processed at a time as much as > possible. > > In your case the subgraph at the beginning is quite limiting, and only > then you go further and match the rest (which expands again) > > I got your global query down to 30s by reordering the matches > > Starting from the BusRels (fewest entries) (but is not a big difference > you can also start with c:Company) this order of matches proved most > effective to prune down the # of rows processed in between. > > match > (c)-[:has_objective]->(:BusRel)-[:looks_for]->(:BusRel)<-[:has_objective]-(p) > with distinct c,p > match > (c)-[:operates_in_sector]->()-[:has_child*0..1]-()<-[:operates_in_sector]-(p) > with distinct c,p > match (c)-[:operates_in_region]->()<-[:operates_in_region]-(p) > with distinct c,p > with c.name as name , count(p) as cnt > return count(*) > ; > > Hope it helps. > > Michael > > Am 28.01.2014 um 16:12 schrieb Ziv Unger <[email protected]>: > > Sorry for the email spam, but based on your suggestion I stripped the > Label lookups on the paths that can't be any other type of label, and > distilled the query down to this: > > match (c:Company {id: > 6})-[:operates_in_region]->(regions)<-[:operates_in_region]-(p:Company),(c)-[:has_objective]->()-[:looks_for]->(busrels)<-[:has_objective]-(p) > match > (c)-[:operates_in_sector]->(sectors)-[:has_child*0..1]-()<-[:operates_in_sector]-(p) > return p.name, collect(distinct(busrels.name)), collect(distinct( > sectors.name)), collect(distinct(regions.name)); > > Which is incredibly returning at 22ms! That's pretty damn impressive. I've > implemented the same changes in the summary query, but still can't get it > to return in decent amount of time. I'll keep trying this side, any more > help in that regard is much appreciated. > > Thanks again! > > On Tuesday, 28 January 2014 16:55:35 UTC+2, Ziv Unger wrote: >> >> Completely missed your last line with the modified query! >> >> I just ran that in a console and it returns around 33ms! >> >> I tried having all 3 matches comma separated in one statement, which >> raised the time to over 700ms, and had it as a continuation of the cypher >> pattern from the first (p:Person), but never thought to just have two. I >> was working on the theory from the Webinar in which you said that multiple >> patterns should be separated out into separate MATCH statements. How does >> one know what will make a difference like this, as in my mind I'd expect >> the results to be similar? >> >> On Tuesday, 28 January 2014 14:17:08 UTC+2, Michael Hunger wrote: >>> >>> >>> I think you already did a really good job. >>> >>> How did you measure? With browser? As Wes said in the webinar that's >>> quite unreliable. So either measure with neo4j-shell or with your own, >>> parametrized code. >>> >>> What kind of indexes/constraints do you have in your db? >>> e.g. unique constraint on :Company(id) or :Region(name) ? >>> >>> You might drop label checks in places where due to the >>> relationship-types there is only one label possible anyway. >>> >>> Did you try to pull more of the pattern in a single match? >>> >>> Also probably sensible to look at the cardinalities in between and try >>> to reduce them if they are getting too high? (e.g. with WITH distinct c,p, >>> regions.name, busrels.name >>> >>> How do individual parts of the pattern perform? >>> >>> match (c:Company {id: 6})-[:operates_in_region]->( >>> regions:Region)<-[:operates_in_region]-(p:Company),(c)-[: >>> has_objective]->(:BusRel)-[:looks_for]->(busrels:BusRel)<- >>> [:has_objective]-(p) >>> >>> match (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0. >>> .1]-(:Sector)<-[:operates_in_sector]-(p) >>> return p.name, collect(distinct(busrels.name)), collect(distinct( >>> sectors.name)), collect(distinct(regions.name)) >>> >>> Any chance to get your db (privately) for testing? >>> >>> Am 28.01.2014 um 12:18 schrieb Ziv Unger <[email protected]>: >>> >>> Hi >>> >>> I have a relatively small database, consisting of the following >>> elements, each of which is defined as a Label: >>> >>> Company >>> BusRel (business relationship) >>> Region >>> Sector >>> >>> Sectors are a parent/child hierarchy (max. 1 level deep), with a >>> [:has_child] relationship between a parent and children. (approximately 122 >>> nodes total) >>> >>> Regions are also a parent/child hierarchy, but are present for example >>> as: (:Region {name: 'Africa'})-[:has_child]->(:Region {name: 'South >>> Africa'})-[:has_child]->(:Region {name: 'Gauteng'})-[:has_child]->(:Region >>> {name: 'Pretoria'}), ie. Continent all the way down to City with a >>> :has_child relationship from parent to child. (approximately 50000 nodes >>> total) >>> >>> BusRel are business relationships. There are matching pairs which are >>> linked to each other (bi-directional links), for example: {:BusRel {name: >>> 'Looking for a distributor'})<-[:looks_for]->(:BusRel {name: 'Looking >>> to distribute'}) (approximately 18 nodes total) >>> >>> Companies aren't linked to each other directly, but rather via one or >>> more of the Labels above. A company could have multiple BusRels and >>> Sectors, but only one Region (at present). (approximately 10000 nodes total) >>> >>> I've attached a screenshot from the web admin to illustrate an example, >>> the nodes being: >>> >>> Blue: Company >>> Grey: Region >>> Yellow: BusRel >>> Orange: Sector >>> >>> I've written a query to start at a specified company (there is a unique >>> constraint on Company property "id") and retrieve all other Companies which >>> share (1) one or more corresponding business relationship (2) the same >>> region (3) one or more sectors. It looks like this: >>> >>> match (c:Company {id: 6})-[:operates_in_region]->( >>> regions:Region)<-[:operates_in_region]-(p:Company) >>> match (c)-[:operates_in_sector]->(sectors:Sector)-[:has_child*0. >>> .1]-(:Sector)<-[:operates_in_sector]-(p) >>> match (c)-[:has_objective]->(:BusRel)-[:looks_for]->( >>> busrels:BusRel)<-[:has_objective]-(p) >>> return p.name, collect(distinct(busrels.name)), collect(distinct( >>> sectors.name)), collect(distinct(regions.name)) >>> >>> This is the product of some query tuning based on some reading and a >>> recent video by Wes and Mark. The problem is that this query returns in >>> about 100 - 120ms, which seems slow to me. (please correct me if I'm wrong, >>> but I've seen examples of far more complex db's and queries returning in >>> under 20ms) >>> >>> I've tried profiling it, which returns the following: >>> >>> ColumnFilter(symKeys=["p.name", " >>> INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f", >>> " INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457", " >>> INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f"], >>> returnItemNames=["p.name", "collect(distinct(busrels.name))", >>> "collect(distinct(sectors.name))", "collect(distinct(regions.name))"], >>> _rows=4, _db_hits=0) >>> EagerAggregation(keys=["Cached(p.name of type Any)"], aggregates=["( >>> INTERNAL_AGGREGATEd6949509-3054-42f0-99dd-d0389f6b0e7f, >>> Distinct(Collect(Property(busrels,name(1))),Property(busrels,name(1))))", >>> "( INTERNAL_AGGREGATE7016f1ed-68e5-4260-bbaf-7c513953e457, >>> Distinct(Collect(Property(sectors,name(1))),Property(sectors,name(1))))", >>> "( INTERNAL_AGGREGATEaf0e7813-ba3e-48b4-a3bf-a2aa25085a4f, >>> Distinct(Collect(Property(regions,name(1))),Property(regions,name(1))))"], >>> _rows=4, _db_hits=48) >>> Extract(symKeys=["sectors", " UNNAMED170", " UNNAMED178", " >>> UNNAMED215", " UNNAMED150", " UNNAMED110", "regions", " UNNAMED274", >>> "p", " UNNAMED25", "c", " UNNAMED65", " UNNAMED235", "busrels", " >>> UNNAMED243"], exprKeys=["p.name"], _rows=10, _db_hits=10) >>> Filter(pred="(((hasLabel( UNNAMED235:BusRel(3)) AND hasLabel( >>> UNNAMED235:BusRel(3))) AND hasLabel(busrels:BusRel(3))) AND >>> hasLabel(busrels:BusRel(3)))", _rows=10, _db_hits=0) >>> PatternMatch(g="(c)-[' UNNAMED215']-( UNNAMED235),(p)-[' >>> UNNAMED274']-(busrels),( UNNAMED235)-[' UNNAMED243']-(busrels)", >>> _rows=10, _db_hits=0) >>> Filter(pred="(((hasLabel(sectors:Sector(0)) AND hasLabel( >>> UNNAMED170:Sector(0))) AND hasLabel( UNNAMED170:Sector(0))) AND >>> hasLabel(sectors:Sector(0)))", _rows=130, _db_hits=0) >>> PatternMatch(g="(c)-[' UNNAMED110']-(sectors),(p)-[' >>> UNNAMED178']-( UNNAMED170),( UNNAMED170)-[' UNNAMED150']-(sectors)", >>> _rows=130, _db_hits=2733) >>> Filter(pred="hasLabel(p:Company(2))", _rows=584, _db_hits=0) >>> TraversalMatcher(trail="(c)-[ >>> UNNAMED25:operates_in_region WHERE (hasLabel(NodeIdentifier():Region(1)) >>> AND hasLabel(NodeIdentifier():Region(1))) AND true]->(regions)<-[ >>> UNNAMED65:operates_in_region WHERE hasLabel(NodeIdentifier():Company(2)) >>> AND true]-(p)", _rows=584, _db_hits=586) >>> >>> There hardly seem to be enough db_hits or rows being processed to >>> warrant a 100ms execution time. I'd be happy retrieving only the first >>> matched region, sector and busrel to optimize the query so that it would >>> stop searching for more matches between two companies and return faster. >>> Ie. I just need one match across the bridging nodes, not all of them, but >>> I'm not sure if that's possible / how to do that. >>> >>> This is running on my development machine, which is an i5 2500K with 8GB >>> of RAM and the DB is on a fast SSD. I've tried increasing the JVM heap size >>> from the default to 1GB, then 2GB and then 4GB with the same results. Any >>> help / tips would be appreciated. >>> >>> The second part of the question involves writing a "summary" query, to >>> create a paged list of Companies and total number of matches, using the >>> same criteria. I've come up with this, sans paging: >>> >>> match (c:Company)-[:operates_in_region]->(:Region)<-[: >>> operates_in_region]-(p:Company) >>> match (c)-[:operates_in_sector]->(:Sector)-[:has_child*0..1]-(: >>> Sector)<-[:operates_in_sector]-(p) >>> match (c)-[:has_objective]->(:BusRel)-[:looks_for]->(: >>> BusRel)<-[:has_objective]-(p) >>> return c.name, count(p) >>> >>> But I suspect there is a serious flaw in the query as this never >>> returns. (or at least it hasn't before hitting the 30s maximum execution >>> time I've set) >>> >>> How would I go about constructing something like that? >>> >>> Many thanks in advance. >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Neo4j" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/groups/opt_out. >>> <debug.png> >>> >>> >>> > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > > For more options, visit https://groups.google.com/groups/opt_out. > > > -- > You received this message because you are subscribed to a topic in the > Google Groups "Neo4j" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/neo4j/BIlliXZKzF8/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
