Hi Michael, thanks for your hints and sorry for my delayed response.
In the meantime (shortly after your response) - my colleague found an error in his index definitions (because of your remarks) which made search on existing labels ways faster. - he doesn't realize the same difficulties (slow performance) in writing new nodes I have. I still have to check up the reason for this differences. By the way: Do you have in mind to provide a workshop with Neo4j as topic in June around Berlin Buzzwords as you have done it last year? - I would be interested to take part. The week before I will met people from SLUB in Dresden. Günter On Tuesday, March 22, 2016 at 8:16:32 AM UTC+1, Michael Hunger wrote: > > Hi Guenter, > > > Am 21.03.2016 um 23:43 schrieb 'Guenter Hipler' via Neo4j < > [email protected] <javascript:>>: > > Hi > > we are running our first steps with Neo4j and used various alternatives to > create an initial database > > 1) we used the Java API with an embedded database > here > > https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76 > a transaction is closed which surrounds 20.000 nodes with relationships to > around 40.000 other nodes. > We are surprised the Transaction.close() method needs up to 30 seconds to > write these nodes to disk > > > It depends on your disk performance, I hope you're not using a spinning > disk? > > so you have 20k + 40k + 40k++ records that you write (plus properties) ? > Then you'd need 4G heap and a fast disk to write them away quickly. > > An option is to reduce batch size to e.g. 10k per tx in total. If your > domain allows it you can also parallelize node-creation and rel-creation > each (watch out for writing to the same nodes though). > > I have some comments on your code below, esp. the tx handling for the > schema creation has to be fixed. > > For the initial import neo4j-import should work well for you. > > You only made the mistake of using your first file twice on the command > line, you probably wanted to use the second file in the second place. > > ./neo4j-import --into [path-to-db]/test.db/ --nodes files/*br.csv* > --nodes files/*br.csv* --relationships:SIGNATUREOF files/signatureof.csv > > > You can also provide the overarching label for the nodes on the > commandline. > > > Cheers, Michael > > > 2) then I wanted to compare my results with the neo4j-import script > provided by the Neo4J-server > Using this method I have difficulties with the format of the csv-files > > My small examples: > first node file: > lsId:ID(localsignature),:LABEL > "NEBIS/002527587",LOCALSIGNATURE > "OCoLC/637556711",LOCALSIGNATURE > > > > second node file: > brId:ID(bibliographicresource),active,:LABEL > 146404300,true,BIBLIOGRAPHICRESOURCE > > > relationship file > :START_ID(bibliographicresource),:END_ID(localsignature),:TYPE > 146404300,"NEBIS/002527587",SIGNATUREOF > 146404300,"OCoLC/637556711",SIGNATUREOF > > ./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes > files/br.csv --relationships:SIGNATUREOF files/signatureof.csv > which throws the exception > > Done in 191ms > Prepare node index > Exception in thread "Thread-3" > org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: > > Id '146404300' is defined more than once in bibliographicresource, at least > at > /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 > and > /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 > at > org.neo4j.unsafe.impl.batchimport.input.BadCollector$2.exception(BadCollector.java:107) > at > org.neo4j.unsafe.impl.batchimport.input.BadCollector.checkTolerance(BadCollector.java:176) > at > org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:96) > at > org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:590) > at > org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:494) > at > org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:282) > at > org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54) > at > org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56) > Duplicate input ids that would otherwise clash can be put into separate id > space, read more about how to use id spaces in the manual: > http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces > Caused by:Id '146404300' is defined more than once in > bibliographicresource, at least at > /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 > and > /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 > > > I can't see any differences in the documentation of > > http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces > because I tried to use the ID space notation (as far as I can see...) > > > > Code comments: > > package org.swissbib.linked.mf.writer; > > > @Description("Transforms documents to a Neo4j graph.") > @In(StreamReceiver.class) > @Out(Void.class) > public class NeoIndexer extends DefaultStreamPipe<ObjectReceiver<String>> { > > private final static Logger LOG = > LoggerFactory.getLogger(NeoIndexer.class); > GraphDatabaseService graphDb; > File dbDir; > Node mainNode; > Transaction tx; > int batchSize; > int counter = 0; > boolean firstRecord = true; > > > public void setBatchSize(String batchSize) { > this.batchSize = Integer.parseInt(batchSize); > } > > public void setDbDir(String dbDir) { > this.dbDir = new File(dbDir); > } > > @Override > public void startRecord(String identifier) { > > if (firstRecord) { > // is there no explicit onStartStream method in your API ? > > // otherwise it might be better to pass the graphDB in to the constructor or > via a setter > > graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(dbDir); > tx = graphDb.beginTx(); > graphDb.schema().indexFor(lsbLabels.PERSON).on("name"); > graphDb.schema().indexFor(lsbLabels.ORGANISATION).on("name"); > > graphDb.schema().indexFor(lsbLabels.BIBLIOGRAPHICRESOURCE).on("name"); > > graphDb.schema().constraintFor(lsbLabels.PERSON).assertPropertyIsUnique("name"); > > graphDb.schema().constraintFor(lsbLabels.ORGANISATION).assertPropertyIsUnique("name"); > > graphDb.schema().constraintFor(lsbLabels.BIBLIOGRAPHICRESOURCE).assertPropertyIsUnique("name"); > > graphDb.schema().constraintFor(lsbLabels.ITEM).assertPropertyIsUnique("name"); > > graphDb.schema().constraintFor(lsbLabels.LOCALSIGNATURE).assertPropertyIsUnique("name"); > tx.success(); > > // misses tx.close() as this is a schema tx which can't be mixed with data tx > > // also if this tx is not finished the indexes and constrains will not be in > place so your lookups will be slow > firstRecord = false; > // create new tx after tx.close() > > } > > counter += 1; > LOG.debug("Working on record {}", identifier); > if (identifier.contains("person")) { > mainNode = createNode(lsbLabels.PERSON, identifier, false); > } else if (identifier.contains("organisation")) { > mainNode = createNode(lsbLabels.ORGANISATION, identifier, false); > } else { > mainNode = createNode(lsbLabels.BIBLIOGRAPHICRESOURCE, > identifier, true); > } > > } > > @Override > public void endRecord() { > tx.success(); > if (counter % batchSize == 0) { > LOG.info("Commit batch upload ({} records processed so far)", > counter); > tx.close(); > tx = graphDb.beginTx(); > } > super.endRecord(); > } > > @Override > public void literal(String name, String value) { > Node node; > > switch (name) { > case "br": > node = graphDb.findNode(lsbLabels.BIBLIOGRAPHICRESOURCE, > "name", value); > mainNode.createRelationshipTo(node, lsbRelations.CONTRIBUTOR); > break; > case "bf:local": > node = createNode(lsbLabels.LOCALSIGNATURE, value, false); > node.createRelationshipTo(mainNode, lsbRelations.SIGNATUREOF); > break; > case "item": > node = createNode(lsbLabels.ITEM, value, false); > node.createRelationshipTo(mainNode, lsbRelations.ITEMOF); > break; > } > } > > // naming of variables! > > // you might consider to use an :Active label instead which is more efficient > than a property > > // but good that you only set the property for the true value > > private Node createNode(Label l, String v, boolean a) { > Node n = graphDb.createNode(l); > n.setProperty("name", v); > if (a) > n.setProperty("active", "true"); > return n; > } > > @Override > protected void onCloseStream() { > LOG.info("Cleaning up (altogether {} records processed)", counter); > // does this always happen after an endRecord? otherwise you need a > tx.success() here > > tx.close(); > } > > private enum lsbLabels implements Label { > BIBLIOGRAPHICRESOURCE, PERSON, ORGANISATION, ITEM, LOCALSIGNATURE > } > > public enum lsbRelations implements RelationshipType { > CONTRIBUTOR, ITEMOF, SIGNATUREOF > } > } > > > > Thanks for any hints! > > Günter > > > -- > You received this message because you are subscribed to the Google Groups > "Neo4j" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected] <javascript:>. > For more options, visit https://groups.google.com/d/optout. > > > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
