I've updated the database dump on Amazon S3 following Michael's suggestion. I will rerun the tests as soon as Michael has finished his investigation.
Best, Frank Am Donnerstag, 11. Juni 2015 17:01:20 UTC+2 schrieb Frank Celler: > > It worked perfectly. > > Am Donnerstag, 11. Juni 2015 15:36:30 UTC+2 schrieb Michael Hunger: >> >> you forgot --id-type integer >> >> the script actually takes care of it >> >> Am 11.06.2015 um 14:55 schrieb Michael Hunger < >> [email protected]>: >> >> I used both 2.2.2 and 2.3-M02 and 2.3-SNAPSHOT for the import. >> >> I can also provide you with the freshly imported databases. Let me know. >> >> Michael >> >> Am 11.06.2015 um 14:45 schrieb Frank Celler <[email protected]>: >> >> Hi Michael, >> >> thanks a lot for the import script. I'm currently trying to generate a >> new database dump (with Neo4J 2.2.2 Community). But I get the following >> error: >> >> $ bash -x ./import.sh >> ... >> + rm -rf pokec.db >> + ./bin/neo4j-import --into pokec.db --id-type --delimiter TAB --quote Ö >> --nodes:PROFILES >> profiles_header.txt,soc-pokec-profiles_no_null_sorted.txt.gz >> --relationships:RELATION >> relationships_header.txt,soc-pokec-relationships.txt.gz >> Exception in thread "main" java.lang.NullPointerException >> at org.neo4j.tooling.ImportTool$6.apply(ImportTool.java:575) >> at org.neo4j.tooling.ImportTool$6.apply(ImportTool.java:571) >> at org.neo4j.helpers.Args.interpretOption(Args.java:490) >> at org.neo4j.tooling.ImportTool.main(ImportTool.java:282) >> at org.neo4j.tooling.ImportTool.main(ImportTool.java:244) >> >> My java is >> >> java version "1.8.0_45" >> Java(TM) SE Runtime Environment (build 1.8.0_45-b14) >> Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) >> >> Do I need 2.3 for the import? >> >> Thanks >> Frank >> >> Am Donnerstag, 11. Juni 2015 13:40:55 UTC+2 schrieb Michael Hunger: >>> >>> I created an import script which I added to my repository. >>> On my machine it imports the data in 35 seconds. >>> >>> Which uses more sensible types for the fields and also skips all the >>> null values. >>> >>> It also uses a numeric id for the primary key which makes more sense to >>> me. >>> >>> If you optimized the dataset for Neo4j you could even use the node-id as >>> primary-key as the input data has a sane, incrementing id then it would be >>> way faster. >>> >>> I also added a neo4j-pokec directory with queries to use that numeric id >>> as input (probably should also use a input.json file that doesn't contains >>> "Pxxx" strings, not sure what the perf impact is of converting those >>> strings). >>> >>> Cheers, Michael >>> >>> https://github.com/jexp/nosql-tests/tree/my-import >>> >>> I did some preliminary testing >>> >>> Neo4j 2.2 >>> >>> node benchmark.js neo4j-pokec -t >>> shortest,neighbors,neighbors2,aggregation,singleRead >>> INFO using server address 127.0.0.1 >>> INFO start >>> INFO executing shortest path for 19 paths >>> INFO total paths length: 104 >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *shortest* path, 19 items >>> INFO Total Time for 19 requests: 85 ms >>> INFO Average: *4.47 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing neighbors for 500 elements >>> INFO total number of neighbors found: 9102 >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *neighbors*, 500 items >>> INFO Total Time for 500 requests: 428 ms >>> INFO Average: *0.86 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing neighbors 2nd degree for 500 elements >>> INFO total number of neighbors2 found: 545530 >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *neighbors2*, 500 items >>> INFO Total Time for 500 requests: 4850 ms >>> INFO Average: *9.7 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing aggregation >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *aggregate*, 1 items >>> INFO Total Time for 1 requests: 14036 ms >>> INFO Average: *14036 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing single read with 100000 documents >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *single reads*, 100000 items >>> INFO Total Time for 100000 requests: 83473 ms >>> INFO Average: *0.83 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> >>> >>> Neo4j 2.3 >>> >>> node benchmark.js neo4j-pokec -t >>> shortest,neighbors,neighbors2,aggregation,singleRead >>> INFO using server address 127.0.0.1 >>> INFO start >>> INFO executing shortest path for 19 paths >>> INFO total paths length: 104 >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *shortest* path, 19 items >>> INFO Total Time for 19 requests: 69 ms >>> INFO Average: *3.63 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing neighbors for 500 elements >>> INFO total number of neighbors found: 9102 >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *neighbors*, 500 items >>> INFO Total Time for 500 requests: 431 ms >>> INFO Average: *0.86 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing neighbors 2nd degree for 500 elements >>> INFO total number of neighbors2 found: 545530 >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *neighbors2*, 500 items >>> INFO Total Time for 500 requests: 3441 ms >>> INFO Average: *6.88 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing aggregation >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *aggregate*, 1 items >>> INFO Total Time for 1 requests: 2848 ms >>> INFO Average: *2848 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO executing single read with 100000 documents >>> INFO >>> ----------------------------------------------------------------------------- >>> INFO Neo4J: *single reads*, 100000 items >>> INFO Total Time for 100000 requests: 77760 ms >>> INFO Average: *0.78 ms* >>> INFO >>> ----------------------------------------------------------------------------- >>> DONE >>> >>> >>> Am 10.06.2015 um 18:55 schrieb Frank Celler <[email protected]>: >>> >>> Hi Michael, >>> >>> thanks for sharing your preliminary findings. I'll incorporate them into >>> the benchmark suite and rerun the tests. I've seen that there is a 30day >>> trial for the enterprise edition. So I can tests that as well. >>> >>> Is it possible to upload the database where you changed the AGE >>> attribute? Or is there any easy cypher command to change the type? >>> >>> Thanks >>> Frank >>> >>> >>> Am Mittwoch, 10. Juni 2015 17:27:05 UTC+2 schrieb Michael Hunger: >>>> >>>> I also did some experiments but didn't have the time to finish yet, >>>> here are my observations so far: >>>> >>>> *Arangodb Measurement* >>>> >>>> - index -> constraint `CREATE CONSTRAINT ON (p:PROFILES) ASSERT p._key >>>> IS UNIQUE;` >>>> - seraph -> replace with node-neo4j 2.0.RC1 >>>> - uses 2 year old /cypher api, doesn't send X-Stream:true header >>>> - does not do efficient auth (encode creds on every call) >>>> - doesn't do pooling >>>> - suboptimal queries >>>> - make sure the concurrency level is adequate for the setup (utilize >>>> all cores but don't flood, use e.g. async.eachWithLimit) >>>> - warmup with nodes and rels `MATCH ()--() return count(*);` >>>> - enterprise with better vertical read/write scalability vs. community >>>> - Use 12G-24G heap, 2G new gen (-Xmn2G) >>>> - pagecache to 2.5G + growth (e.g. another 2.5G) >>>> - in 2.2 set cache_type = soft or cache_type=none depending on >>>> available heap >>>> - fix property encoding, e.g. AGE as int not string, don't store "null" >>>> !! >>>> -> affects esp. aggregate query >>>> - don't re-run the benchmark on the same store, start at the initial one >>>> -> creating and deleting the additional PROFILES_TEMP nodes affects >>>> repeatability of results >>>> >>>> correct datatypes: >>>> >>>> * "null" should *never be stored* >>>> * int: public, gender, completion_percentage, AGE, >>>> * long/time: last_login, registration >>>> * optionally as label: gender, public >>>> >>>> -> test repository (WIP): with changes in *description.js and >>>> benchmark.js* >>>> >>>> https://github.com/jexp/nosql-tests/tree/node-neo4j >>>> >>>> queries for for neo4j-shell: >>>> >>>> export from="P/P1" >>>> export to="P/P277" >>>> >>>> export key="P/P1" >>>> >>>> // warmup >>>> MATCH ()--() return count(*); >>>> // 61.245.128 rows >>>> >>>> MATCH (s:PROFILES) return count(*); >>>> // 1.632.803 profiles >>>> // 1.15 s >>>> >>>> profile >>>> >>>> MATCH (s:PROFILES {_key:{key}})-[*1..2]->(n:PROFILES) RETURN DISTINCT >>>> n._key; >>>> // 295 rows 5 ms >>>> >>>> >>>> // 1st degree neighbours >>>> MATCH (:PROFILES {_key:{key}})-->(n) RETURN n._key; >>>> // 14 rows 1ms >>>> >>>> // 2nd degree neighbours >>>> MATCH (s:PROFILES {_key:{key}})-->(x) >>>> MATCH (x)-->(n:PROFILES) >>>> RETURN DISTINCT n._key; >>>> // 283 rows 6 ms >>>> >>>> // shortest path >>>> MATCH (s:PROFILES {_key:{from}}),(t:PROFILES {_key:{to}}), >>>> p = shortestPath((s)-[*..15]->(t)) RETURN [x in nodes(p) | x._key] as >>>> path; >>>> // 1 ms, don't return the full data only keys like in the other db's >>>> >>>> // aggregation >>>> MATCH (f:PROFILES) RETURN f.AGE, count(*); >>>> // 22s -> should be rather 1.5s >>>> >>>> // single read >>>> MATCH (f:PROFILES) WHERE f._key = {key} RETURN f; >>>> // or >>>> MATCH (s:PROFILES {_key:{key}}) RETURN s; >>>> // 1 row with 59 properties 1 ms >>>> >>>> // single writes >>>> CREATE (s:PROFILES_TEMP {data}) RETURN id(s); >>>> >>>> // delete all nodes with a certain label >>>> // loop until returns 0 >>>> MATCH (n:PROFILES_TEMP) WITH n LIMIT 5000 OPTIONAL MATCH (n)-[r]-() >>>> DELETE n,r RETURN count(*) as deleted >>>> ---- >>>> >>>> MATCH (s:PROFILES {_key:{key}})-[*1..2]->(n:PROFILES) WITH DISTINCT >>>> n._key as key RETURN count(*); >>>> // 295 count 5-6ms >>>> >>>> MATCH (f:PROFILES) return id(f) % 140, count(*); >>>> // 140 rows -> 1502 ms that's how it should be >>>> >>>> sample data: >>>> >>>> _key:"P/P1", >>>> public:"1", >>>> completion_percentage:"14", >>>> gender:"1", >>>> region:"zilinsky kraj, zilina", >>>> last_login:"2012-05-25 11:20:00.0", >>>> registration:"2005-04-03 00:00:00.0", >>>> AGE:26, >>>> body:"185 cm, 90 kg", >>>> I_am_working_in_field:"it", >>>> spoken_languages:"anglicky", >>>> hobbies:"sportovanie, spanie, kino, jedlo, pocuvanie hudby, priatelia, >>>> divadlo", >>>> I_most_enjoy_good_food:"v dobrej restauracii", >>>> pets:"mam psa", >>>> body_type:"null", >>>> my_eyesight:"null", >>>> eye_color:"null", >>>> hair_color:"null", >>>> hair_type:"null", >>>> completed_level_of_education:"null", >>>> favourite_color:"null", >>>> relation_to_smoking:"null", >>>> relation_to_alcohol:"null", >>>> sign_in_zodiac:"null", >>>> on_pokec_i_am_looking_for:"null", >>>> love_is_for_me:"null", >>>> relation_to_casual_sex:"null", >>>> my_partner_should_be:"null", >>>> marital_status:"null", >>>> children:"null", >>>> relation_to_children:"null", >>>> I_like_movies:"null", >>>> I_like_watching_movie:"null", >>>> I_like_music:"null", >>>> I_mostly_like_listening_to_music:"null", >>>> the_idea_of_good_evening:"null", >>>> I_like_specialties_from_kitchen:"null", >>>> fun:"null", >>>> I_am_going_to_concerts:"null", >>>> my_active_sports:"null", >>>> my_passive_sports:"null", >>>> profession:"null", >>>> I_like_books:"null", >>>> life_style:"null", >>>> music:"null", >>>> cars:"null", >>>> politics:"null", >>>> relationships:"null", >>>> art_culture:"null", >>>> hobbies_interests:"null", >>>> science_technologies:"null", >>>> computers_internet:"null", >>>> education:"null", >>>> sport:"null", >>>> movies:"null", >>>> travelling:"null", >>>> health:"null", >>>> companies_brands:"null", >>>> more:"null" >>>> >>>> >>>> neo4j-server.properties: >>>> org.neo4j.server.database.location=/Users/mh/support/arangodb/db/data >>>> org.neo4j.server.webserver.port=8474 >>>> dbms.security.auth_enabled=false >>>> >>>> >>>> neo4j-wrapper.conf: >>>> wrapper.java.initmemory=8000 >>>> wrapper.java.maxmemory=8000 >>>> wrapper.java.additional=-Xmn2G >>>> >>>> neo4j.properties: >>>> dbms.pagecache.memory=5G >>>> keep_logical_logs=false >>>> remote_shell_enabled=false >>>> cache_type=soft >>>> online_backup_enabled=false >>>> >>>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Neo4j" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/d/optout. >> >> >> >> -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
