CC'in the Marks here. I don't know which endpoint Neo4p uses by default.
As far as I can see you also run just one query per http-request / tx ? Usually you want to batch a score of them in a single tx (e.g. 20k elements). To run your code we probably need the xls files as well :) Also your query below shouldn't run so long, even if you have some 2 million entries in your db. Any chance to share your db with me? Am 02.01.2014 um 00:14 schrieb icenine <[email protected]>: > Hi Michael, > > I've posted the newest code here: http://pastebin.com/7PikkZRP > > I've switched all of my CREATE UNIQUE statements to MERGE. I'm still > convinced though that the inserts are under-performing. It's taking about 5 > minutes for all of the statements in that code to execute 500 times. Overtime > this gap grows to around 8 even up to 12 minutes. My biggest bottleneck seems > to be the incident nodes and their relationships: > > neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() RETURN DISTINCT labels(n), > type(r), count(*) ORDER BY labels(n)[0], type(r); > +-----------------------------------------------------------------------+ > | labels(n) | type(r) | count(*) | > +-----------------------------------------------------------------------+ > | ["DEGREE_OF_HARM"] | "HAS_INCIDENT_DEGREE_OF_HARM" | 2120424 | > | ["DEGREE_OF_HARM"] | "HAS_NRLS_DATA_TYPE" | 7 | > | ["INCIDENT"] | "HAS_INCIDENT_CATEGORY" | 2120457 | > | ["INCIDENT"] | "HAS_INCIDENT_DEGREE_OF_HARM" | 2120424 | > | ["INCIDENT"] | "HAS_INCIDENT_PATIENT" | 2120432 | > | ["INCIDENT"] | "HAS_INCIDENT_REPORTER" | 2120442 | > | ["INCIDENT"] | "HAS_INCIDENT_SPECIALITY" | 2120450 | > | ["INCIDENT"] | "HAS_NRLS_DATA_TYPE" | 2120486 | > | ["INCIDENT"] | "IS_NHS_TRUST_INCIDENT" | 2120483 | > | ["INCIDENT"] | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2114664 | > | ["INCIDENT_CATEGORY"] | "HAS_INCIDENT_CATEGORY" | 2120457 | > | ["INCIDENT_CATEGORY"] | "HAS_NRLS_DATA_TYPE" | 16 | > | ["INCIDENT_REPORTER"] | "HAS_INCIDENT_REPORTER" | 2120442 | > | ["INCIDENT_REPORTER"] | "HAS_NRLS_DATA_TYPE" | 12 | > | ["INCIDENT_SPECIALITY"] | "HAS_INCIDENT_SPECIALITY" | 2120450 | > | ["INCIDENT_SPECIALITY"] | "HAS_NRLS_DATA_TYPE" | 17 | > | ["NHS_TRUST"] | "HAS_NHS_TRUST_LOCATION" | 480 | > | ["NHS_TRUST"] | "HAS_NRLS_DATA_TYPE" | 63 | > | ["NHS_TRUST"] | "IS_NHS_TRUST_INCIDENT" | 2120483 | > | ["NHS_TRUST_LOCATION"] | "HAS_NHS_TRUST_LOCATION" | 480 | > | ["NHS_TRUST_LOCATION"] | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2114664 | > | ["NRLS_DATA_TYPE"] | "HAS_NRLS_DATA_TYPE" | 2123426 | > | ["PATIENT"] | "HAS_INCIDENT_PATIENT" | 2120432 | > | ["PATIENT"] | "HAS_NRLS_DATA_TYPE" | 2825 | > +-----------------------------------------------------------------------+ > 24 rows > 247418 ms > > MERGE seems to be slightly more consistent in performance than CREATE UNIQUE > though not that much faster. > > I've tried the following to tune the instance (note I have 8G's of RAM on the > VM and there's nothing else using it besides Neo4j and my extract process > which never takes up much more than 100M of RAM now that I've tuned it with > MAJ's suggestions): > > cache_type=hpc > node_cache_array_fraction=6 > relationship_cache_array_fraction=7 > #node_cache_size=1024 > relationship_cache_size=2G > > I haven't bothered tuning node_cache_size itself since it's my relationship > store that seems to be the biggest access point, accessing the node count > takes 14 seconds but accessing a relationship count takes around 2 - 3 > minutes. > > Current heap usage after a restart and while the script is running after > processing 1000 rows is ~ 500M. > > Current neostore sizes are: > > [root@miyu graph.db]# ls -l neostore* | awk '{printf("%10s %s\n", $5, $9)}' > 63 neostore > 9 neostore.id > 55 neostore.labeltokenstore.db > 9 neostore.labeltokenstore.db.id > 456 neostore.labeltokenstore.db.names > 9 neostore.labeltokenstore.db.names.id > 29850422 neostore.nodestore.db > 9 neostore.nodestore.db.id > 68 neostore.nodestore.db.labels > 9 neostore.nodestore.db.labels.id > 177676780 neostore.propertystore.db > 128 neostore.propertystore.db.arrays > 9 neostore.propertystore.db.arrays.id > 9 neostore.propertystore.db.id > 162 neostore.propertystore.db.index > 9 neostore.propertystore.db.index.id > 722 neostore.propertystore.db.index.keys > 9 neostore.propertystore.db.index.keys.id > 679805312 neostore.propertystore.db.strings > 9 neostore.propertystore.db.strings.id > 560433951 neostore.relationshipstore.db > 9 neostore.relationshipstore.db.id > 45 neostore.relationshiptypestore.db > 9 neostore.relationshiptypestore.db.id > 380 neostore.relationshiptypestore.db.names > 9 neostore.relationshiptypestore.db.names.id > 1600 neostore.schemastore.db > 9 neostore.schemastore.db.id > > Current cached mappings settings are: > > neostore.nodestore.db.mapped_memory=50M > neostore.relationshipstore.db.mapped_memory=756M > neostore.propertystore.db.mapped_memory=300M > neostore.propertystore.db.strings.mapped_memory=756M > neostore.propertystore.db.arrays.mapped_memory=50M > > Current initial heap settings are: > > # Initial Java Heap Size (in MB) > wrapper.java.initmemory=2048 > > # Maximum Java Heap Size (in MB) > wrapper.java.maxmemory=5632 > > Current schema: > > neo4j-sh (?)$ schema > Welcome to the Neo4j Shell! Enter 'help' for a list of commands > [Reconnected to server] > Indexes > ON :DEGREE_OF_HARM(degree_of_harm) ONLINE (for uniqueness > constraint) > ON :INCIDENT(incident_description) ONLINE > > ON :INCIDENT(incident_timestamp) ONLINE > > ON :INCIDENT(incident_id) ONLINE (for uniqueness > constraint) > ON :INCIDENT_CATEGORY(category_level_01) ONLINE (for uniqueness > constraint) > ON :INCIDENT_REPORTER(reporter_level_01) ONLINE (for uniqueness > constraint) > ON :INCIDENT_SPECIALITY(speciality_level_01) ONLINE (for uniqueness > constraint) > ON :NHS_TRUST(name) ONLINE (for uniqueness > constraint) > ON :NHS_TRUST_LOCATION(location_level_01) ONLINE (for uniqueness > constraint) > ON :NRLS_DATA_TYPE(code) ONLINE (for uniqueness > constraint) > ON :PATIENT(patient_age) ONLINE > > ON :PATIENT(patient_sex) ONLINE > > ON :PATIENT(patient_ethnicity) ONLINE > > > Constraints > ON (nrls_data_type:NRLS_DATA_TYPE) ASSERT nrls_data_type.code IS UNIQUE > ON (nhs_trust:NHS_TRUST) ASSERT nhs_trust.name IS UNIQUE > ON (degree_of_harm:DEGREE_OF_HARM) ASSERT degree_of_harm.degree_of_harm IS > UNIQUE > ON (incident:INCIDENT) ASSERT incident.incident_id IS UNIQUE > ON (nhs_trust_location:NHS_TRUST_LOCATION) ASSERT > nhs_trust_location.location_level_01 IS UNIQUE > ON (incident_reporter:INCIDENT_REPORTER) ASSERT > incident_reporter.reporter_level_01 IS UNIQUE > ON (incident_category:INCIDENT_CATEGORY) ASSERT > incident_category.category_level_01 IS UNIQUE > ON (incident_speciality:INCIDENT_SPECIALITY) ASSERT > incident_speciality.speciality_level_01 IS UNIQUE > > I'm going to keep trying to tweak but since I can't use property index hints > with my MERGE statements (which I think would help with the incident > relationships) I'm just loading anyway so I can get this done as I've been at > it for a while. > > If you have any further suggestions (or anyone else does) I'd be glad to try > them out. > > ~ icenine > > On Wednesday, January 1, 2014 10:33:33 PM UTC, Michael Hunger wrote: > Great ! > > Looks good. > > I think if you use parameters and Neo4p's cypher support passing in > perl-hashes for parameters and the using transactional endpoint with your > import data it shouldn't take too long to import your 2 million data points. > > #1 parameters > #2 transactional endpoint > #3 sensible batch-size (e.g. 20k per commit) > #4 usually when just creating data you don't have to return anything. > > Cheers > > Michael > > Am 01.01.2014 um 19:48 schrieb JDS <[email protected]>: > >> BTW, I love the simplicity of something like this: >> >> neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE { code : 'IN05_lvl1' }) >> > MERGE (ic:INCIDENT_CATEGORY { category_level_01 : 'FOOBAR' >> > })-[r:HAS_NRLS_DATA_TYPE]->(ndt) >> > RETURN ic, r; >> +-----------------------------------------------------------------------------+ >> | ic | r >> | >> +-----------------------------------------------------------------------------+ >> | Node[2121668]{category_level_01:"FOOBAR"} | >> :HAS_NRLS_DATA_TYPE[16880045]{} | >> +-----------------------------------------------------------------------------+ >> 1 row >> Nodes created: 1 >> Relationships created: 1 >> Properties set: 1 >> >> >> >> On Wednesday, January 1, 2014 6:36:04 PM UTC, Michael Hunger wrote: >> No worries, it's still early in the New Year :) >> >> But you definitely want to write a blog post about what you're doing with >> Neo4j? Right? >> >> Happy New Year >> >> Michael >> >> Am 01.01.2014 um 19:33 schrieb JDS <[email protected]>: >> >>> Ugh *shame* >>> >>> Thanks Mike >>> >>> On Wednesday, January 1, 2014 6:32:16 PM UTC, Michael Hunger wrote: >>> Typo: >>> >>> In query #4 you use "NRLS_DATA_TYPE" in the previous ones you use >>> "NLRS_DATA_TYPE" >>> >>> N_RL_S vs. N_LR_S >>> >>> HTH >>> >>> Michael >>> >>> Am 01.01.2014 um 19:26 schrieb JDS <[email protected]>: >>> >>>> Maybe I'm wrong but I thought that all 3 of the top queries would return >>>> data based on the data returned by 1st and 4th query so I'm a little >>>> confused. Server is 2.0.0 enterprise stable. >>>> >>>> neo4j-sh (?)$ START n=node(*) WHERE HAS (n.code) AND n.code = 'IN05_lvl1' >>>> RETURN n.code; >>>> +-------------+ >>>> | n.code | >>>> +-------------+ >>>> | "IN05_lvl1" | >>>> +-------------+ >>>> 1 row >>>> 102405 ms >>>> neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE) WHERE ndt.code = 'IN05_lvl1' >>>> RETURN ndt.code; >>>> +----------+ >>>> | ndt.code | >>>> +----------+ >>>> +----------+ >>>> 0 row >>>> 31 ms >>>> neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE { code : 'IN05_lvl1' }) RETURN >>>> ndt.code; >>>> +----------+ >>>> | ndt.code | >>>> +----------+ >>>> +----------+ >>>> 0 row >>>> 20 ms >>>> neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE) RETURN ndt.code; >>>> >>>> +-------------------+ >>>> | ndt.code | >>>> +-------------------+ >>>> | "RP07" | >>>> | "IN07" | >>>> | "Age_at_Incident" | >>>> | "ST01_LVL1" | >>>> | "PD09" | >>>> | "PD05_lvl1" | >>>> | "IN05_lvl1" | >>>> | "IN03_lvl1" | >>>> | "IN07_01MMYY" | >>>> | "PD11" | >>>> | "IN02_A_01" | >>>> | "IN01" | >>>> | "PD02" | >>>> +-------------------+ >>>> 13 rows >>>> 113 ms >>>> neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE { code : "IN05_lvl1" }) RETURN >>>> ndt.code; >>>> >>>> >>>> >>>> -- >>>> You received this message because you are subscribed to the Google Groups >>>> "Neo4j" group. >>>> To unsubscribe from this group and stop receiving emails from it, send an >>>> email to [email protected]. >>>> For more options, visit https://groups.google.com/groups/opt_out. >>> >>> >>> -- >>> You received this message because you are subscribed to the Google Groups >>> "Neo4j" group. >>> To unsubscribe from this group and stop receiving emails from it, send an >>> email to [email protected]. >>> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
