Hey Martin, What was the outcome of this? Did you manage to run the benchmarks on Neo4J?
Cheers, Manish On Friday, June 28, 2013 at 10:45:40 PM UTC+10, Michael Hunger wrote: > > Hi Martin, > > that sound great, please keep us in the loop about your progress, happy to > support you whenever. > > Michael > > > On Thu, Jun 27, 2013 at 7:04 PM, Martin Bravenboer < > [email protected] <javascript:>> wrote: > >> Hi Michael, >> >> Thanks a lot for your help! >> >> The idea for a 'type' index worked very well. I was able to tune the >> slowest queries (which included q2) to be about 10x faster. The nested >> MATCH did not help in this particular case, but did help for other >> queries. >> >> I'm working on moving the benchmark into a public repository, and do >> also have some follow-up questions to understand the execution plans. >> I briefly wanted to get back to you already to thank you for your help >> while I'm working on that. >> >> Cheers, >> Martin >> >> >> On Wed, Jun 26, 2013 at 12:16 AM, Michael Hunger >> <[email protected] <javascript:>> wrote: >> > Thanks so much Martin for working on that, >> > >> > I have no knowledge of LUBM so please excuse my ignorance. >> > >> > Can you say something about the datamodel and cardinalities in general? >> > >> > It would be great if you could share the imported zipped database for >> > instance on dropbox, so we could take a stab at optimizing the queries. >> > >> > >> > Query2 is an OLAP query something neo4j is not optimized for per-se. >> > >> > Especially for query 2 it might be interesting to index nodes by type >> too. >> > >> > So it would become: >> > >> > START >> > student = node:types(type = 'GraduateStudent') >> > MATCH >> > student-[:memberOf]->dept, >> > dept-[:subOrganizationOf]->univ, >> > student-[:undergraduateDegreeFrom]->univ >> > RETURN >> > student, dept, univ >> > >> > >> > I would probably rewrite the query to start at the universities which >> are >> > fewer starting nodes: >> > >> > START >> > univ = node:types(type = 'University') >> > MATCH >> > student-[:memberOf]->dept, >> > dept-[:subOrganizationOf]->univ, >> > student-[:undergraduateDegreeFrom]->univ >> > RETURN >> > student, dept, univ >> > >> > something else one might try is to break down the match into individual >> > matches and handle reduce the amount of data processed in flight. >> > >> > Something like this: >> > >> > START >> > univ = node:types(type = 'University') >> > MATCH >> > dept-[:subOrganizationOf]->univ, >> > WITH univ, collect(dept) as depts >> > >> > MATCH >> > student-[:undergraduateDegreeFrom]->univ >> > WHERE ANY(dept in depts : student-[:memberOf]->dept) >> > RETURN >> > student, univ >> > >> > Regading your import, code it looks good, very clean. >> > As stated I would try to use an index to index per type. >> > You should batch transactions in in groups of 20-30k elements. >> > >> > >> > >> > On Tue, Jun 25, 2013 at 5:51 AM, Martin Bravenboer >> > <[email protected] <javascript:>> wrote: >> >> >> >> Hi all, >> >> >> >> To better understand the capabilities of graph databases, I'm working >> >> on porting the LUBM benchmark ( >> >> http://swat.cse.lehigh.edu/projects/lubm/ ) to Neo4J. Because I'm not >> >> yet very familiar with Neo4J, I'm looking for some general advice on >> >> whether the approach I'm following seems wise (which is why I used the >> >> mailing list instead of StackOverflow). >> >> >> >> Initially, I tried to import the generated RDF data using >> >> Tinkerpop/BluePrints and run SPARQL queries using OpenRDF. This didn't >> >> work out that well: the import of the large volume of RDF data >> >> performed very poorly, to the extent that I really could not populate >> >> a database with a reasonable scale. For this reason, I switched to >> >> importing CSV files that we generate from the RDF data. The attached >> >> program (Main.java) is a preliminary version of this import tool. I >> >> also felt that the SPARQL approach would limit the tuning we can do on >> >> the queries, and the RDF graph is so specific to RDF that it seems >> >> hard to query it using Java or Cypher. The CSV data is imported in a >> >> way that more closely resembles what seems to be a typical Neo4J >> >> schema, for example basic properties like 'email' become properties of >> >> the node, rather than separate nodes and edges. Because the LUBM >> >> benchmark depends on some basic OWL inference capabilities, I'm also >> >> adding some ad-hoc code to 'fix' the graph to manually do this >> >> inference. You can see an example of this in the attached Main.java, >> >> which is creating the proper edges for super-classes. This was also >> >> needed in the original RDF version. This tool performs pretty nicely >> >> now. >> >> >> >> Some questions I have here in this first cut of the benchmark: >> >> >> >> 1) I found this: >> >> https://svn.neo4j.org/laboratory/users/johan/lubm/trunk/ , but the >> >> implementation seemed very out of date, both in the import code, as >> >> well as the Java-based queries. The implementation of some queries >> >> also didn't seem very efficient, which you can see from the >> >> spreadsheet in that repository. Are there any other LUBM >> >> implementations about that perhaps I did not find? >> >> >> >> 2) Have other people also observed that importing RDF via the >> >> BluePrints API performs significantly worse than importing a more >> >> barebone graph using the Neo4J API directly? Is it a well-known thing >> >> that querying via SPARQL/RDF is not the best demonstration of Neo4J's >> >> abilities? >> >> >> >> 3) Do you see any bad/poorly performing patterns in the import code? >> >> We're trying to first generate all nodes, and then separately create >> >> the edges, to avoid having to do this for all edge data. Is that a >> >> good pattern to follow? >> >> >> >> 4) I've included two Cyper queries (q1 and q2 from LUBM) that seem >> >> like fairly faithful translations of the original query. In q2, I had >> >> some difficultly deciding what a good start node is. There really >> >> isn't one, because it relates a whole bunch of nodes. This query >> >> currently does not perform that well, do you have any suggestions to >> >> tune it? In general I've had difficultly getting matches that relate >> >> several nodes to perform well. Perhaps there is a better way to write >> >> these? >> >> >> >> q1 >> >> ---------------- >> >> START course = node:ids(id = >> >> 'http://www.Department0.University0.edu/GraduateCourse0') >> >> MATCH course<-[:takesCourse]-x-[:type]->t >> >> WHERE t.id = 'GraduateStudent' >> >> RETURN x >> >> ---------------- >> >> >> >> q2 >> >> ---------------- >> >> START >> >> grad = node:ids(id = 'GraduateStudent') >> >> //univ = node:ids(id = 'University'), >> >> //dept = node:ids(id = 'Department') >> >> MATCH >> >> x-[:type]->grad, >> >> //y-[:type]->univ, >> >> //z-[:type]->dept, >> >> x-[:memberOf]->z, >> >> z-[:subOrganizationOf]->y, >> >> x-[:undergraduateDegreeFrom]->y >> >> RETURN >> >> x, y, z >> >> ---------------- >> >> For reference, the Java implementation of q2 in the link above was: >> >> >> >> >> https://svn.neo4j.org/laboratory/users/johan/lubm/trunk/src/main/java/org/neo4j/lubm/barebone/Query2.java >> >> >> >> I very much appreciate any advice the group can share! >> >> >> >> Once we have some more queries running, we'll be very happy to share >> >> the implementation of the LUBM suite so that people can review the >> >> implementation, or perhaps even use it in the future. >> >> >> >> Thanks, >> >> Martin >> >> >> >> -- >> >> You received this message because you are subscribed to the Google >> Groups >> >> "Neo4j" group. >> >> To unsubscribe from this group and stop receiving emails from it, send >> an >> >> email to [email protected] <javascript:>. >> >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> >> >> > >> > -- >> > You received this message because you are subscribed to the Google >> Groups >> > "Neo4j" group. >> > To unsubscribe from this group and stop receiving emails from it, send >> an >> > email to [email protected] <javascript:>. >> > For more options, visit https://groups.google.com/groups/opt_out. >> > >> > >> >> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
