Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

'Guenter Hipler' via Neo4j Sun, 10 Apr 2016 10:43:40 -0700

Hi Michael,

thanks for your hints and sorry for my delayed response.


In the meantime (shortly after your response) 
- my colleague found an error in his index definitions (because of your 
remarks) which made search on existing labels ways faster. 
- he doesn't realize the same difficulties (slow performance) in writing 
new nodes I have. I still have to  check up the reason for this differences.

By the way: Do you have in mind to provide a workshop with Neo4j as topic 
in June around Berlin Buzzwords as you have done it last year? - I would be 
interested to take part.

The week before I will met people from SLUB in Dresden.

Günter



On Tuesday, March 22, 2016 at 8:16:32 AM UTC+1, Michael Hunger wrote:
>
> Hi Guenter,
>
>
> Am 21.03.2016 um 23:43 schrieb 'Guenter Hipler' via Neo4j <
> [email protected] <javascript:>>:
>
> Hi
>
> we are running our first steps with Neo4j and used various alternatives to 
> create an initial database
>
> 1) we used the Java API with an embedded database
> here 
>
> https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76
> a transaction is closed which surrounds 20.000 nodes with relationships to 
> around 40.000 other nodes. 
> We are surprised the Transaction.close() method needs up to 30 seconds to 
> write these nodes to disk
>
>
> It depends on your disk performance, I hope you're not using a spinning 
> disk?
>
> so you have 20k + 40k + 40k++ records that you write (plus properties) ?
> Then you'd need 4G heap and a fast disk to write them away quickly.
>
> An option is to reduce batch size to e.g. 10k per tx in total. If your 
> domain allows it you can also parallelize node-creation and rel-creation 
> each (watch out for writing to the same nodes though). 
>
> I have some comments on your code below, esp. the tx handling for the 
> schema creation has to be fixed.
>
> For the initial import neo4j-import should work well for you.
>
> You only made the mistake of using your first file twice on the command 
> line, you probably wanted to use the second file in the second place.
>
> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/*br.csv* 
> --nodes files/*br.csv* --relationships:SIGNATUREOF files/signatureof.csv
>
>
> You can also provide the overarching label for the nodes on the 
> commandline.
>
>
> Cheers, Michael
>
>
> 2) then I wanted to compare my results with the neo4j-import script 
> provided by the Neo4J-server
> Using this method I have difficulties with the format of the csv-files
>
> My small examples:
> first node file:
> lsId:ID(localsignature),:LABEL
> "NEBIS/002527587",LOCALSIGNATURE
> "OCoLC/637556711",LOCALSIGNATURE
>
>
>
> second node file:
> brId:ID(bibliographicresource),active,:LABEL
> 146404300,true,BIBLIOGRAPHICRESOURCE
>
>
> relationship file
> :START_ID(bibliographicresource),:END_ID(localsignature),:TYPE
> 146404300,"NEBIS/002527587",SIGNATUREOF
> 146404300,"OCoLC/637556711",SIGNATUREOF
>
> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes 
> files/br.csv --relationships:SIGNATUREOF files/signatureof.csv
> which throws the exception 
>
> Done in 191ms
> Prepare node index
> Exception in thread "Thread-3" 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException:
>  
> Id '146404300' is defined more than once in bibliographicresource, at least 
> at 
> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 
> and 
> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>     at 
> org.neo4j.unsafe.impl.batchimport.input.BadCollector$2.exception(BadCollector.java:107)
>     at 
> org.neo4j.unsafe.impl.batchimport.input.BadCollector.checkTolerance(BadCollector.java:176)
>     at 
> org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:96)
>     at 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:590)
>     at 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:494)
>     at 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:282)
>     at 
> org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
>     at 
> org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)
> Duplicate input ids that would otherwise clash can be put into separate id 
> space, read more about how to use id spaces in the manual: 
> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
> Caused by:Id '146404300' is defined more than once in 
> bibliographicresource, at least at 
> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 
> and 
> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>
>
> I can't see any differences in the documentation of
>
> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
> because I tried to use the ID space notation (as far as I can see...)
>
>
>
> Code comments:
>
> package org.swissbib.linked.mf.writer;
>
>
> @Description("Transforms documents to a Neo4j graph.")
> @In(StreamReceiver.class)
> @Out(Void.class)
> public class NeoIndexer extends DefaultStreamPipe<ObjectReceiver<String>> {
>
>     private final static Logger LOG = 
> LoggerFactory.getLogger(NeoIndexer.class);
>     GraphDatabaseService graphDb;
>     File dbDir;
>     Node mainNode;
>     Transaction tx;
>     int batchSize;
>     int counter = 0;
>     boolean firstRecord = true;
>
>
>     public void setBatchSize(String batchSize) {
>         this.batchSize = Integer.parseInt(batchSize);
>     }
>
>     public void setDbDir(String dbDir) {
>         this.dbDir = new File(dbDir);
>     }
>
>     @Override
>     public void startRecord(String identifier) {
>
>         if (firstRecord) {
> // is there no explicit onStartStream method in your API ?
>
> // otherwise it might be better to pass the graphDB in to the constructor or 
> via  a setter
>
>             graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(dbDir);
>             tx = graphDb.beginTx();
>             graphDb.schema().indexFor(lsbLabels.PERSON).on("name");
>             graphDb.schema().indexFor(lsbLabels.ORGANISATION).on("name");
>             
> graphDb.schema().indexFor(lsbLabels.BIBLIOGRAPHICRESOURCE).on("name");
>             
> graphDb.schema().constraintFor(lsbLabels.PERSON).assertPropertyIsUnique("name");
>             
> graphDb.schema().constraintFor(lsbLabels.ORGANISATION).assertPropertyIsUnique("name");
>             
> graphDb.schema().constraintFor(lsbLabels.BIBLIOGRAPHICRESOURCE).assertPropertyIsUnique("name");
>             
> graphDb.schema().constraintFor(lsbLabels.ITEM).assertPropertyIsUnique("name");
>             
> graphDb.schema().constraintFor(lsbLabels.LOCALSIGNATURE).assertPropertyIsUnique("name");
>             tx.success();
>
> // misses tx.close() as this is a schema tx which can't be mixed with data tx
>
> // also if this tx is not finished the indexes and constrains will not be in 
> place so your lookups will be slow
>             firstRecord = false;
> // create new tx after tx.close()
>
>         }
>
>         counter += 1;
>         LOG.debug("Working on record {}", identifier);
>         if (identifier.contains("person")) {
>             mainNode = createNode(lsbLabels.PERSON, identifier, false);
>         } else if (identifier.contains("organisation")) {
>             mainNode = createNode(lsbLabels.ORGANISATION, identifier, false);
>         } else {
>             mainNode = createNode(lsbLabels.BIBLIOGRAPHICRESOURCE, 
> identifier, true);
>         }
>
>     }
>
>     @Override
>     public void endRecord() {
>         tx.success();
>         if (counter % batchSize == 0) {
>             LOG.info("Commit batch upload ({} records processed so far)", 
> counter);
>             tx.close();
>             tx = graphDb.beginTx();
>         }
>         super.endRecord();
>     }
>
>     @Override
>     public void literal(String name, String value) {
>         Node node;
>
>         switch (name) {
>             case "br":
>                 node = graphDb.findNode(lsbLabels.BIBLIOGRAPHICRESOURCE, 
> "name", value);
>                 mainNode.createRelationshipTo(node, lsbRelations.CONTRIBUTOR);
>                 break;
>             case "bf:local":
>                 node = createNode(lsbLabels.LOCALSIGNATURE, value, false);
>                 node.createRelationshipTo(mainNode, lsbRelations.SIGNATUREOF);
>                 break;
>             case "item":
>                 node = createNode(lsbLabels.ITEM, value, false);
>                 node.createRelationshipTo(mainNode, lsbRelations.ITEMOF);
>                 break;
>         }
>     }
>
> // naming of variables!
>
> // you might consider to use an :Active label instead which is more efficient 
> than a property
>
> // but good that you only set the property for the true value
>
>     private Node createNode(Label l, String v, boolean a) {
>         Node n = graphDb.createNode(l);
>         n.setProperty("name", v);
>         if (a)
>             n.setProperty("active", "true");
>         return n;
>     }
>
>     @Override
>     protected void onCloseStream() {
>         LOG.info("Cleaning up (altogether {} records processed)", counter);
> // does this always happen after an endRecord? otherwise you need a 
> tx.success() here
>
>         tx.close();
>     }
>
>     private enum lsbLabels implements Label {
>         BIBLIOGRAPHICRESOURCE, PERSON, ORGANISATION, ITEM, LOCALSIGNATURE
>     }
>
>     public enum lsbRelations implements RelationshipType {
>         CONTRIBUTOR, ITEMOF, SIGNATUREOF
>     }
> }
>
>
>
> Thanks for any hints!
>
> Günter
>
>
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] <javascript:>.
> For more options, visit https://groups.google.com/d/optout.
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

Reply via email to