Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

'Michael Hunger' via Neo4j Sun, 10 Apr 2016 11:46:36 -0700

Last year we had a hackathon the Sunday before. And I presented on graph
compute with Neo4j.


You can also meet me in Dresden when I'm around.

Please let me know if you want to meet. Where are you originally located?



On Sun, Apr 10, 2016 at 7:42 PM, 'Guenter Hipler' via Neo4j <
[email protected]> wrote:

> Hi Michael,
>
> thanks for your hints and sorry for my delayed response.
>
> In the meantime (shortly after your response)
> - my colleague found an error in his index definitions (because of your
> remarks) which made search on existing labels ways faster.
> - he doesn't realize the same difficulties (slow performance) in writing
> new nodes I have. I still have to  check up the reason for this differences.
>
> By the way: Do you have in mind to provide a workshop with Neo4j as topic
> in June around Berlin Buzzwords as you have done it last year? - I would be
> interested to take part.
>
> The week before I will met people from SLUB in Dresden.
>
> Günter
>
>
>
> On Tuesday, March 22, 2016 at 8:16:32 AM UTC+1, Michael Hunger wrote:
>>
>> Hi Guenter,
>>
>>
>> Am 21.03.2016 um 23:43 schrieb 'Guenter Hipler' via Neo4j <
>> [email protected]>:
>>
>> Hi
>>
>> we are running our first steps with Neo4j and used various alternatives
>> to create an initial database
>>
>> 1) we used the Java API with an embedded database
>> here
>>
>> https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76
>> a transaction is closed which surrounds 20.000 nodes with relationships
>> to around 40.000 other nodes.
>> We are surprised the Transaction.close() method needs up to 30 seconds to
>> write these nodes to disk
>>
>>
>> It depends on your disk performance, I hope you're not using a spinning
>> disk?
>>
>> so you have 20k + 40k + 40k++ records that you write (plus properties) ?
>> Then you'd need 4G heap and a fast disk to write them away quickly.
>>
>> An option is to reduce batch size to e.g. 10k per tx in total. If your
>> domain allows it you can also parallelize node-creation and rel-creation
>> each (watch out for writing to the same nodes though).
>>
>> I have some comments on your code below, esp. the tx handling for the
>> schema creation has to be fixed.
>>
>> For the initial import neo4j-import should work well for you.
>>
>> You only made the mistake of using your first file twice on the command
>> line, you probably wanted to use the second file in the second place.
>>
>> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/*br.csv*
>> --nodes files/*br.csv* --relationships:SIGNATUREOF files/signatureof.csv
>>
>>
>> You can also provide the overarching label for the nodes on the
>> commandline.
>>
>>
>> Cheers, Michael
>>
>>
>> 2) then I wanted to compare my results with the neo4j-import script
>> provided by the Neo4J-server
>> Using this method I have difficulties with the format of the csv-files
>>
>> My small examples:
>> first node file:
>> lsId:ID(localsignature),:LABEL
>> "NEBIS/002527587",LOCALSIGNATURE
>> "OCoLC/637556711",LOCALSIGNATURE
>>
>>
>>
>> second node file:
>> brId:ID(bibliographicresource),active,:LABEL
>> 146404300,true,BIBLIOGRAPHICRESOURCE
>>
>>
>> relationship file
>> :START_ID(bibliographicresource),:END_ID(localsignature),:TYPE
>> 146404300,"NEBIS/002527587",SIGNATUREOF
>> 146404300,"OCoLC/637556711",SIGNATUREOF
>>
>> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes
>> files/br.csv --relationships:SIGNATUREOF files/signatureof.csv
>> which throws the exception
>>
>> Done in 191ms
>> Prepare node index
>> Exception in thread "Thread-3"
>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException:
>> Id '146404300' is defined more than once in bibliographicresource, at least
>> at
>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>> and
>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>>     at
>> org.neo4j.unsafe.impl.batchimport.input.BadCollector$2.exception(BadCollector.java:107)
>>     at
>> org.neo4j.unsafe.impl.batchimport.input.BadCollector.checkTolerance(BadCollector.java:176)
>>     at
>> org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:96)
>>     at
>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:590)
>>     at
>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:494)
>>     at
>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:282)
>>     at
>> org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
>>     at
>> org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)
>> Duplicate input ids that would otherwise clash can be put into separate
>> id space, read more about how to use id spaces in the manual:
>> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
>> Caused by:Id '146404300' is defined more than once in
>> bibliographicresource, at least at
>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>> and
>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>>
>>
>> I can't see any differences in the documentation of
>>
>> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
>> because I tried to use the ID space notation (as far as I can see...)
>>
>>
>>
>> Code comments:
>>
>> package org.swissbib.linked.mf.writer;
>>
>>
>> @Description("Transforms documents to a Neo4j graph.")
>> @In(StreamReceiver.class)
>> @Out(Void.class)
>> public class NeoIndexer extends DefaultStreamPipe<ObjectReceiver<String>> {
>>
>>     private final static Logger LOG = 
>> LoggerFactory.getLogger(NeoIndexer.class);
>>     GraphDatabaseService graphDb;
>>     File dbDir;
>>     Node mainNode;
>>     Transaction tx;
>>     int batchSize;
>>     int counter = 0;
>>     boolean firstRecord = true;
>>
>>
>>     public void setBatchSize(String batchSize) {
>>         this.batchSize = Integer.parseInt(batchSize);
>>     }
>>
>>     public void setDbDir(String dbDir) {
>>         this.dbDir = new File(dbDir);
>>     }
>>
>>     @Override
>>     public void startRecord(String identifier) {
>>
>>         if (firstRecord) {
>> // is there no explicit onStartStream method in your API ?
>>
>> // otherwise it might be better to pass the graphDB in to the constructor or 
>> via  a setter
>>
>>             graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(dbDir);
>>             tx = graphDb.beginTx();
>>             graphDb.schema().indexFor(lsbLabels.PERSON).on("name");
>>             graphDb.schema().indexFor(lsbLabels.ORGANISATION).on("name");
>>             
>> graphDb.schema().indexFor(lsbLabels.BIBLIOGRAPHICRESOURCE).on("name");
>>             
>> graphDb.schema().constraintFor(lsbLabels.PERSON).assertPropertyIsUnique("name");
>>             
>> graphDb.schema().constraintFor(lsbLabels.ORGANISATION).assertPropertyIsUnique("name");
>>             
>> graphDb.schema().constraintFor(lsbLabels.BIBLIOGRAPHICRESOURCE).assertPropertyIsUnique("name");
>>             
>> graphDb.schema().constraintFor(lsbLabels.ITEM).assertPropertyIsUnique("name");
>>             
>> graphDb.schema().constraintFor(lsbLabels.LOCALSIGNATURE).assertPropertyIsUnique("name");
>>             tx.success();
>>
>> // misses tx.close() as this is a schema tx which can't be mixed with data tx
>>
>> // also if this tx is not finished the indexes and constrains will not be in 
>> place so your lookups will be slow
>>             firstRecord = false;
>> // create new tx after tx.close()
>>
>>         }
>>
>>         counter += 1;
>>         LOG.debug("Working on record {}", identifier);
>>         if (identifier.contains("person")) {
>>             mainNode = createNode(lsbLabels.PERSON, identifier, false);
>>         } else if (identifier.contains("organisation")) {
>>             mainNode = createNode(lsbLabels.ORGANISATION, identifier, false);
>>         } else {
>>             mainNode = createNode(lsbLabels.BIBLIOGRAPHICRESOURCE, 
>> identifier, true);
>>         }
>>
>>     }
>>
>>     @Override
>>     public void endRecord() {
>>         tx.success();
>>         if (counter % batchSize == 0) {
>>             LOG.info("Commit batch upload ({} records processed so far)", 
>> counter);
>>             tx.close();
>>             tx = graphDb.beginTx();
>>         }
>>         super.endRecord();
>>     }
>>
>>     @Override
>>     public void literal(String name, String value) {
>>         Node node;
>>
>>         switch (name) {
>>             case "br":
>>                 node = graphDb.findNode(lsbLabels.BIBLIOGRAPHICRESOURCE, 
>> "name", value);
>>                 mainNode.createRelationshipTo(node, 
>> lsbRelations.CONTRIBUTOR);
>>                 break;
>>             case "bf:local":
>>                 node = createNode(lsbLabels.LOCALSIGNATURE, value, false);
>>                 node.createRelationshipTo(mainNode, 
>> lsbRelations.SIGNATUREOF);
>>                 break;
>>             case "item":
>>                 node = createNode(lsbLabels.ITEM, value, false);
>>                 node.createRelationshipTo(mainNode, lsbRelations.ITEMOF);
>>                 break;
>>         }
>>     }
>>
>> // naming of variables!
>>
>> // you might consider to use an :Active label instead which is more 
>> efficient than a property
>>
>> // but good that you only set the property for the true value
>>
>>     private Node createNode(Label l, String v, boolean a) {
>>         Node n = graphDb.createNode(l);
>>         n.setProperty("name", v);
>>         if (a)
>>             n.setProperty("active", "true");
>>         return n;
>>     }
>>
>>     @Override
>>     protected void onCloseStream() {
>>         LOG.info("Cleaning up (altogether {} records processed)", counter);
>> // does this always happen after an endRecord? otherwise you need a 
>> tx.success() here
>>
>>         tx.close();
>>     }
>>
>>     private enum lsbLabels implements Label {
>>         BIBLIOGRAPHICRESOURCE, PERSON, ORGANISATION, ITEM, LOCALSIGNATURE
>>     }
>>
>>     public enum lsbRelations implements RelationshipType {
>>         CONTRIBUTOR, ITEMOF, SIGNATUREOF
>>     }
>> }
>>
>>
>>
>> Thanks for any hints!
>>
>> Günter
>>
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/d/optout.
>>
>>
>> --
> You received this message because you are subscribed to the Google Groups
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

Reply via email to