Yes, it was a nice day in the C-Base Raumstation. I'm located in Switzerland, Basel and I'm working for the swissbib project (https://www.swissbib.ch/) that's the reason why I'm going to meet people from SLUB to get more familiar with their D:Swarm solution. So in general Dresden is far away.... and Buzzwords would be a possibility.
Very best wishes, Günter On Sunday, April 10, 2016 at 8:46:14 PM UTC+2, Michael Hunger wrote: > > Last year we had a hackathon the Sunday before. And I presented on graph > compute with Neo4j. > > You can also meet me in Dresden when I'm around. > > Please let me know if you want to meet. Where are you originally located? > > > > On Sun, Apr 10, 2016 at 7:42 PM, 'Guenter Hipler' via Neo4j < > [email protected] <javascript:>> wrote: > >> Hi Michael, >> >> thanks for your hints and sorry for my delayed response. >> >> In the meantime (shortly after your response) >> - my colleague found an error in his index definitions (because of your >> remarks) which made search on existing labels ways faster. >> - he doesn't realize the same difficulties (slow performance) in writing >> new nodes I have. I still have to check up the reason for this differences. >> >> By the way: Do you have in mind to provide a workshop with Neo4j as topic >> in June around Berlin Buzzwords as you have done it last year? - I would be >> interested to take part. >> >> The week before I will met people from SLUB in Dresden. >> >> Günter >> >> >> >> On Tuesday, March 22, 2016 at 8:16:32 AM UTC+1, Michael Hunger wrote: >>> >>> Hi Guenter, >>> >>> >>> Am 21.03.2016 um 23:43 schrieb 'Guenter Hipler' via Neo4j < >>> [email protected]>: >>> >>> Hi >>> >>> we are running our first steps with Neo4j and used various alternatives >>> to create an initial database >>> >>> 1) we used the Java API with an embedded database >>> here >>> >>> https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76 >>> a transaction is closed which surrounds 20.000 nodes with relationships >>> to around 40.000 other nodes. >>> We are surprised the Transaction.close() method needs up to 30 seconds >>> to write these nodes to disk >>> >>> >>> It depends on your disk performance, I hope you're not using a spinning >>> disk? >>> >>> so you have 20k + 40k + 40k++ records that you write (plus properties) ? >>> Then you'd need 4G heap and a fast disk to write them away quickly. >>> >>> An option is to reduce batch size to e.g. 10k per tx in total. If your >>> domain allows it you can also parallelize node-creation and rel-creation >>> each (watch out for writing to the same nodes though). >>> >>> I have some comments on your code below, esp. the tx handling for the >>> schema creation has to be fixed. >>> >>> For the initial import neo4j-import should work well for you. >>> >>> You only made the mistake of using your first file twice on the command >>> line, you probably wanted to use the second file in the second place. >>> >>> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/*br.csv* >>> --nodes files/*br.csv* --relationships:SIGNATUREOF files/signatureof.csv >>> >>> >>> You can also provide the overarching label for the nodes on the >>> commandline. >>> >>> >>> Cheers, Michael >>> >>> >>> 2) then I wanted to compare my results with the neo4j-import script >>> provided by the Neo4J-server >>> Using this method I have difficulties with the format of the csv-files >>> >>> My small examples: >>> first node file: >>> lsId:ID(localsignature),:LABEL >>> "NEBIS/002527587",LOCALSIGNATURE >>> "OCoLC/637556711",LOCALSIGNATURE >>> >>> >>> >>> second node file: >>> brId:ID(bibliographicresource),active,:LABEL >>> 146404300,true,BIBLIOGRAPHICRESOURCE >>> >>> >>> relationship file >>> :START_ID(bibliographicresource),:END_ID(localsignature),:TYPE >>> 146404300,"NEBIS/002527587",SIGNATUREOF >>> 146404300,"OCoLC/637556711",SIGNATUREOF >>> >>> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes >>> files/br.csv --relationships:SIGNATUREOF files/signatureof.csv >>> which throws the exception >>> >>> Done in 191ms >>> Prepare node index >>> Exception in thread "Thread-3" >>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException: >>> >>> Id '146404300' is defined more than once in bibliographicresource, at least >>> at >>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 >>> and >>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 >>> at >>> org.neo4j.unsafe.impl.batchimport.input.BadCollector$2.exception(BadCollector.java:107) >>> at >>> org.neo4j.unsafe.impl.batchimport.input.BadCollector.checkTolerance(BadCollector.java:176) >>> at >>> org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:96) >>> at >>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:590) >>> at >>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:494) >>> at >>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:282) >>> at >>> org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54) >>> at >>> org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56) >>> Duplicate input ids that would otherwise clash can be put into separate >>> id space, read more about how to use id spaces in the manual: >>> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces >>> Caused by:Id '146404300' is defined more than once in >>> bibliographicresource, at least at >>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 >>> and >>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 >>> >>> >>> I can't see any differences in the documentation of >>> >>> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces >>> because I tried to use the ID space notation (as far as I can see...) >>> >>> >>> >>> Code comments: >>> >>> package org.swissbib.linked.mf.writer; >>> >>> >>> @Description("Transforms documents to a Neo4j graph.") >>> @In(StreamReceiver.class) >>> @Out(Void.class) >>> public class NeoIndexer extends DefaultStreamPipe<ObjectReceiver<String>> { >>> >>> private final static Logger LOG = >>> LoggerFactory.getLogger(NeoIndexer.class); >>> GraphDatabaseService graphDb; >>> File dbDir; >>> Node mainNode; >>> Transaction tx; >>> int batchSize; >>> int counter = 0; >>> boolean firstRecord = true; >>> >>> >>> public void setBatchSize(String batchSize) { >>> this.batchSize = Integer.parseInt(batchSize); >>> } >>> >>> public void setDbDir(String dbDir) { >>> this.dbDir = new File(dbDir); >>> } >>> >>> @Override >>> public void startRecord(String identifier) { >>> >>> if (firstRecord) { >>> // is there no explicit onStartStream method in your API ? >>> >>> // otherwise it might be better to pass the graphDB in to the constructor >>> or via a setter >>> >>> graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(dbDir); >>> tx = graphDb.beginTx(); >>> graphDb.schema().indexFor(lsbLabels.PERSON).on("name"); >>> graphDb.schema().indexFor(lsbLabels.ORGANISATION).on("name"); >>> >>> graphDb.schema().indexFor(lsbLabels.BIBLIOGRAPHICRESOURCE).on("name"); >>> >>> graphDb.schema().constraintFor(lsbLabels.PERSON).assertPropertyIsUnique("name"); >>> >>> graphDb.schema().constraintFor(lsbLabels.ORGANISATION).assertPropertyIsUnique("name"); >>> >>> graphDb.schema().constraintFor(lsbLabels.BIBLIOGRAPHICRESOURCE).assertPropertyIsUnique("name"); >>> >>> graphDb.schema().constraintFor(lsbLabels.ITEM).assertPropertyIsUnique("name"); >>> >>> graphDb.schema().constraintFor(lsbLabels.LOCALSIGNATURE).assertPropertyIsUnique("name"); >>> tx.success(); >>> >>> // misses tx.close() as this is a schema tx which can't be mixed with data >>> tx >>> >>> // also if this tx is not finished the indexes and constrains will not be >>> in place so your lookups will be slow >>> firstRecord = false; >>> // create new tx after tx.close() >>> >>> } >>> >>> counter += 1; >>> LOG.debug("Working on record {}", identifier); >>> if (identifier.contains("person")) { >>> mainNode = createNode(lsbLabels.PERSON, identifier, false); >>> } else if (identifier.contains("organisation")) { >>> mainNode = createNode(lsbLabels.ORGANISATION, identifier, >>> false); >>> } else { >>> mainNode = createNode(lsbLabels.BIBLIOGRAPHICRESOURCE, >>> identifier, true); >>> } >>> >>> } >>> >>> @Override >>> public void endRecord() { >>> tx.success(); >>> if (counter % batchSize == 0) { >>> LOG.info("Commit batch upload ({} records processed so far)", >>> counter); >>> tx.close(); >>> tx = graphDb.beginTx(); >>> } >>> super.endRecord(); >>> } >>> >>> @Override >>> public void literal(String name, String value) { >>> Node node; >>> >>> switch (name) { >>> case "br": >>> node = graphDb.findNode(lsbLabels.BIBLIOGRAPHICRESOURCE, >>> "name", value); >>> mainNode.createRelationshipTo(node, >>> lsbRelations.CONTRIBUTOR); >>> break; >>> case "bf:local": >>> node = createNode(lsbLabels.LOCALSIGNATURE, value, false); >>> node.createRelationshipTo(mainNode, >>> lsbRelations.SIGNATUREOF); >>> break; >>> case "item": >>> node = createNode(lsbLabels.ITEM, value, false); >>> node.createRelationshipTo(mainNode, lsbRelations.ITEMOF); >>> break; >>> } >>> } >>> >>> // naming of variables! >>> >>> // you might consider to use an :Active label instead which is more >>> efficient than a property >>> >>> // but good that you only set the property for the true value >>> >>> private Node createNode(Label l, String v, boolean a) { >>> Node n = graphDb.createNode(l); >>> n.setProperty("name", v); >>> if (a) >>> n.setProperty("active", "true"); >>> return n; >>> } >>> >>> @Override >>> protected void onCloseStream() { >>> LOG.info("Cleaning up (altogether {} records processed)", counter); >>> // does this always happen after an endRecord? otherwise you need a >>> tx.success() here >>> >>> tx.close(); >>> } >>> >>> private enum lsbLabels implements Label { >>> BIBLIOGRAPHICRESOURCE, PERSON, ORGANISATION, ITEM, LOCALSIGNATURE >>> } >>> >>> public enum lsbRelations implements RelationshipType { >>> CONTRIBUTOR, ITEMOF, SIGNATUREOF >>> } >>> } >>> >>> >>> >>> Thanks for any hints! >>> >>> Günter >>> >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "Neo4j" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> For more options, visit https://groups.google.com/d/optout. >>> >>> >>> -- >> You received this message because you are subscribed to the Google Groups >> "Neo4j" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
