Re: [Neo4j] Missing data MATCH vs START

Michael Hunger Wed, 01 Jan 2014 15:57:49 -0800

CC'in the Marks here.

I don't know which endpoint Neo4p uses by default.


As far as I can see you also run just one query per http-request / tx ?
Usually you want to batch a score of them in a single tx (e.g. 20k elements).


To run your code we probably need the xls files as well :)

Also your query below shouldn't run so long, even if you have some 2 million 
entries in your db.

Any chance to share your db with me?

Am 02.01.2014 um 00:14 schrieb icenine <[email protected]>:

> Hi Michael,
> 
> I've posted the newest code here: http://pastebin.com/7PikkZRP
> 
> I've switched all of my CREATE UNIQUE statements to MERGE. I'm still 
> convinced though that the inserts are under-performing. It's taking about 5 
> minutes for all of the statements in that code to execute 500 times. Overtime 
> this gap grows to around 8 even up to 12 minutes. My biggest bottleneck seems 
> to be the incident nodes and their relationships:
> 
> neo4j-sh (?)$ START n=node(*) MATCH (n)-[r]-() RETURN DISTINCT labels(n), 
> type(r), count(*) ORDER BY labels(n)[0], type(r);
> +-----------------------------------------------------------------------+
> | labels(n)               | type(r)                          | count(*) |
> +-----------------------------------------------------------------------+
> | ["DEGREE_OF_HARM"]      | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2120424  |
> | ["DEGREE_OF_HARM"]      | "HAS_NRLS_DATA_TYPE"             | 7        |
> | ["INCIDENT"]            | "HAS_INCIDENT_CATEGORY"          | 2120457  |
> | ["INCIDENT"]            | "HAS_INCIDENT_DEGREE_OF_HARM"    | 2120424  |
> | ["INCIDENT"]            | "HAS_INCIDENT_PATIENT"           | 2120432  |
> | ["INCIDENT"]            | "HAS_INCIDENT_REPORTER"          | 2120442  |
> | ["INCIDENT"]            | "HAS_INCIDENT_SPECIALITY"        | 2120450  |
> | ["INCIDENT"]            | "HAS_NRLS_DATA_TYPE"             | 2120486  |
> | ["INCIDENT"]            | "IS_NHS_TRUST_INCIDENT"          | 2120483  |
> | ["INCIDENT"]            | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2114664  |
> | ["INCIDENT_CATEGORY"]   | "HAS_INCIDENT_CATEGORY"          | 2120457  |
> | ["INCIDENT_CATEGORY"]   | "HAS_NRLS_DATA_TYPE"             | 16       |
> | ["INCIDENT_REPORTER"]   | "HAS_INCIDENT_REPORTER"          | 2120442  |
> | ["INCIDENT_REPORTER"]   | "HAS_NRLS_DATA_TYPE"             | 12       |
> | ["INCIDENT_SPECIALITY"] | "HAS_INCIDENT_SPECIALITY"        | 2120450  |
> | ["INCIDENT_SPECIALITY"] | "HAS_NRLS_DATA_TYPE"             | 17       |
> | ["NHS_TRUST"]           | "HAS_NHS_TRUST_LOCATION"         | 480      |
> | ["NHS_TRUST"]           | "HAS_NRLS_DATA_TYPE"             | 63       |
> | ["NHS_TRUST"]           | "IS_NHS_TRUST_INCIDENT"          | 2120483  |
> | ["NHS_TRUST_LOCATION"]  | "HAS_NHS_TRUST_LOCATION"         | 480      |
> | ["NHS_TRUST_LOCATION"]  | "IS_NHS_TRUST_LOCATION_INCIDENT" | 2114664  |
> | ["NRLS_DATA_TYPE"]      | "HAS_NRLS_DATA_TYPE"             | 2123426  |
> | ["PATIENT"]             | "HAS_INCIDENT_PATIENT"           | 2120432  |
> | ["PATIENT"]             | "HAS_NRLS_DATA_TYPE"             | 2825     |
> +-----------------------------------------------------------------------+
> 24 rows
> 247418 ms
> 
> MERGE seems to be slightly more consistent in performance than CREATE UNIQUE 
> though not that much faster.
> 
> I've tried the following to tune the instance (note I have 8G's of RAM on the 
> VM and there's nothing else using it besides Neo4j and my extract process 
> which never takes up much more than 100M of RAM now that I've tuned it with 
> MAJ's suggestions):
> 
> cache_type=hpc
> node_cache_array_fraction=6
> relationship_cache_array_fraction=7
> #node_cache_size=1024
> relationship_cache_size=2G
> 
> I haven't bothered tuning node_cache_size itself since it's my relationship 
> store that seems to be the biggest access point, accessing the node count 
> takes 14 seconds but accessing a relationship count takes around 2 - 3 
> minutes.
> 
> Current heap usage after a restart and while the script is running after 
> processing 1000 rows is ~ 500M.
> 
> Current neostore sizes are:
> 
> [root@miyu graph.db]# ls -l neostore* | awk '{printf("%10s %s\n", $5, $9)}'
>         63 neostore
>          9 neostore.id
>         55 neostore.labeltokenstore.db
>          9 neostore.labeltokenstore.db.id
>        456 neostore.labeltokenstore.db.names
>          9 neostore.labeltokenstore.db.names.id
>   29850422 neostore.nodestore.db
>          9 neostore.nodestore.db.id
>         68 neostore.nodestore.db.labels
>          9 neostore.nodestore.db.labels.id
>  177676780 neostore.propertystore.db
>        128 neostore.propertystore.db.arrays
>          9 neostore.propertystore.db.arrays.id
>          9 neostore.propertystore.db.id
>        162 neostore.propertystore.db.index
>          9 neostore.propertystore.db.index.id
>        722 neostore.propertystore.db.index.keys
>          9 neostore.propertystore.db.index.keys.id
>  679805312 neostore.propertystore.db.strings
>          9 neostore.propertystore.db.strings.id
>  560433951 neostore.relationshipstore.db
>          9 neostore.relationshipstore.db.id
>         45 neostore.relationshiptypestore.db
>          9 neostore.relationshiptypestore.db.id
>        380 neostore.relationshiptypestore.db.names
>          9 neostore.relationshiptypestore.db.names.id
>       1600 neostore.schemastore.db
>          9 neostore.schemastore.db.id
> 
> Current cached mappings settings are:
> 
> neostore.nodestore.db.mapped_memory=50M
> neostore.relationshipstore.db.mapped_memory=756M
> neostore.propertystore.db.mapped_memory=300M
> neostore.propertystore.db.strings.mapped_memory=756M
> neostore.propertystore.db.arrays.mapped_memory=50M
> 
> Current initial heap settings are:
> 
> # Initial Java Heap Size (in MB)
> wrapper.java.initmemory=2048
> 
> # Maximum Java Heap Size (in MB)
> wrapper.java.maxmemory=5632
> 
> Current schema:
> 
> neo4j-sh (?)$ schema
> Welcome to the Neo4j Shell! Enter 'help' for a list of commands
> [Reconnected to server]
> Indexes
>   ON :DEGREE_OF_HARM(degree_of_harm)           ONLINE (for uniqueness 
> constraint) 
>   ON :INCIDENT(incident_description)           ONLINE                         
>     
>   ON :INCIDENT(incident_timestamp)             ONLINE                         
>     
>   ON :INCIDENT(incident_id)                    ONLINE (for uniqueness 
> constraint) 
>   ON :INCIDENT_CATEGORY(category_level_01)     ONLINE (for uniqueness 
> constraint) 
>   ON :INCIDENT_REPORTER(reporter_level_01)     ONLINE (for uniqueness 
> constraint) 
>   ON :INCIDENT_SPECIALITY(speciality_level_01) ONLINE (for uniqueness 
> constraint) 
>   ON :NHS_TRUST(name)                          ONLINE (for uniqueness 
> constraint) 
>   ON :NHS_TRUST_LOCATION(location_level_01)    ONLINE (for uniqueness 
> constraint) 
>   ON :NRLS_DATA_TYPE(code)                     ONLINE (for uniqueness 
> constraint) 
>   ON :PATIENT(patient_age)                     ONLINE                         
>     
>   ON :PATIENT(patient_sex)                     ONLINE                         
>     
>   ON :PATIENT(patient_ethnicity)               ONLINE                         
>     
> 
> Constraints
>   ON (nrls_data_type:NRLS_DATA_TYPE) ASSERT nrls_data_type.code IS UNIQUE
>   ON (nhs_trust:NHS_TRUST) ASSERT nhs_trust.name IS UNIQUE
>   ON (degree_of_harm:DEGREE_OF_HARM) ASSERT degree_of_harm.degree_of_harm IS 
> UNIQUE
>   ON (incident:INCIDENT) ASSERT incident.incident_id IS UNIQUE
>   ON (nhs_trust_location:NHS_TRUST_LOCATION) ASSERT 
> nhs_trust_location.location_level_01 IS UNIQUE
>   ON (incident_reporter:INCIDENT_REPORTER) ASSERT 
> incident_reporter.reporter_level_01 IS UNIQUE
>   ON (incident_category:INCIDENT_CATEGORY) ASSERT 
> incident_category.category_level_01 IS UNIQUE
>   ON (incident_speciality:INCIDENT_SPECIALITY) ASSERT 
> incident_speciality.speciality_level_01 IS UNIQUE
> 
> I'm going to keep trying to tweak but since I can't use property index hints 
> with my MERGE statements (which I think would help with the incident 
> relationships) I'm just loading anyway so I can get this done as I've been at 
> it for a while.
> 
> If you have any further suggestions (or anyone else does) I'd be glad to try 
> them out.
> 
> ~ icenine
> 
> On Wednesday, January 1, 2014 10:33:33 PM UTC, Michael Hunger wrote:
> Great !
> 
> Looks good.
> 
> I think if you use parameters and Neo4p's cypher support passing in 
> perl-hashes for parameters and the using transactional endpoint with your 
> import data it shouldn't take too long to import your 2 million data points.
> 
> #1 parameters
> #2 transactional endpoint
> #3 sensible batch-size (e.g. 20k per commit)
> #4 usually when just creating data you don't have to return anything.
> 
> Cheers
> 
> Michael
> 
> Am 01.01.2014 um 19:48 schrieb JDS <[email protected]>:
> 
>> BTW, I love the simplicity of something like this:
>> 
>> neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE { code : 'IN05_lvl1' })
>> > MERGE (ic:INCIDENT_CATEGORY { category_level_01 : 'FOOBAR' 
>> > })-[r:HAS_NRLS_DATA_TYPE]->(ndt)
>> > RETURN ic, r;
>> +-----------------------------------------------------------------------------+
>> | ic                                        | r                              
>>  |
>> +-----------------------------------------------------------------------------+
>> | Node[2121668]{category_level_01:"FOOBAR"} | 
>> :HAS_NRLS_DATA_TYPE[16880045]{} |
>> +-----------------------------------------------------------------------------+
>> 1 row
>> Nodes created: 1
>> Relationships created: 1
>> Properties set: 1
>> 
>> 
>> 
>> On Wednesday, January 1, 2014 6:36:04 PM UTC, Michael Hunger wrote:
>> No worries, it's still early in the New Year :)
>> 
>> But you definitely want to write a blog post about what you're doing with 
>> Neo4j? Right?
>> 
>> Happy New Year
>> 
>> Michael
>> 
>> Am 01.01.2014 um 19:33 schrieb JDS <[email protected]>:
>> 
>>> Ugh *shame*
>>> 
>>> Thanks Mike
>>> 
>>> On Wednesday, January 1, 2014 6:32:16 PM UTC, Michael Hunger wrote:
>>> Typo:
>>> 
>>> In query #4 you use "NRLS_DATA_TYPE" in the previous ones you use 
>>> "NLRS_DATA_TYPE"
>>> 
>>> N_RL_S vs. N_LR_S
>>> 
>>> HTH
>>> 
>>> Michael
>>> 
>>> Am 01.01.2014 um 19:26 schrieb JDS <[email protected]>:
>>> 
>>>> Maybe I'm wrong but I thought that all 3 of the top queries would return 
>>>> data based on the data returned by 1st and 4th query so I'm a little 
>>>> confused. Server is 2.0.0 enterprise stable.
>>>> 
>>>> neo4j-sh (?)$ START n=node(*) WHERE HAS (n.code) AND n.code = 'IN05_lvl1' 
>>>> RETURN n.code;
>>>> +-------------+
>>>> | n.code      |
>>>> +-------------+
>>>> | "IN05_lvl1" |
>>>> +-------------+
>>>> 1 row
>>>> 102405 ms
>>>> neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE) WHERE ndt.code = 'IN05_lvl1' 
>>>> RETURN ndt.code;        
>>>> +----------+
>>>> | ndt.code |
>>>> +----------+
>>>> +----------+
>>>> 0 row
>>>> 31 ms
>>>> neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE { code : 'IN05_lvl1' }) RETURN 
>>>> ndt.code;              
>>>> +----------+
>>>> | ndt.code |
>>>> +----------+
>>>> +----------+
>>>> 0 row
>>>> 20 ms
>>>> neo4j-sh (?)$ MATCH (ndt:NRLS_DATA_TYPE) RETURN ndt.code;                  
>>>>                    
>>>> +-------------------+
>>>> | ndt.code          |
>>>> +-------------------+
>>>> | "RP07"            |
>>>> | "IN07"            |
>>>> | "Age_at_Incident" |
>>>> | "ST01_LVL1"       |
>>>> | "PD09"            |
>>>> | "PD05_lvl1"       |
>>>> | "IN05_lvl1"       |
>>>> | "IN03_lvl1"       |
>>>> | "IN07_01MMYY"     |
>>>> | "PD11"            |
>>>> | "IN02_A_01"       |
>>>> | "IN01"            |
>>>> | "PD02"            |
>>>> +-------------------+
>>>> 13 rows
>>>> 113 ms
>>>> neo4j-sh (?)$ MATCH (ndt:NLRS_DATA_TYPE { code : "IN05_lvl1" }) RETURN 
>>>> ndt.code;
>>>> 
>>>> 
>>>> 
>>>> -- 
>>>> You received this message because you are subscribed to the Google Groups 
>>>> "Neo4j" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>>> email to [email protected].
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>> 
>>> 
>>> -- 
>>> You received this message because you are subscribed to the Google Groups 
>>> "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send an 
>>> email to [email protected].
>>> For more options, visit https://groups.google.com/groups/opt_out.
>> 
>> 
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Neo4j] Missing data MATCH vs START

Reply via email to