Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

'Guenter Hipler' via Neo4j Wed, 13 Apr 2016 11:54:04 -0700

Yes, it was a nice day in the C-Base Raumstation.
I'm located in Switzerland, Basel and I'm working for the swissbib project 
(https://www.swissbib.ch/)  that's the reason why I'm going to meet people 
from SLUB to get more familiar with their D:Swarm solution. So in general 
Dresden is far away.... and Buzzwords would be a possibility.


Very best wishes, Günter

On Sunday, April 10, 2016 at 8:46:14 PM UTC+2, Michael Hunger wrote:
>
> Last year we had a hackathon the Sunday before. And I presented on graph 
> compute with Neo4j.
>
> You can also meet me in Dresden when I'm around.
>
> Please let me know if you want to meet. Where are you originally located?
>
>
>
> On Sun, Apr 10, 2016 at 7:42 PM, 'Guenter Hipler' via Neo4j <
> [email protected] <javascript:>> wrote:
>
>> Hi Michael,
>>
>> thanks for your hints and sorry for my delayed response.
>>
>> In the meantime (shortly after your response) 
>> - my colleague found an error in his index definitions (because of your 
>> remarks) which made search on existing labels ways faster. 
>> - he doesn't realize the same difficulties (slow performance) in writing 
>> new nodes I have. I still have to  check up the reason for this differences.
>>
>> By the way: Do you have in mind to provide a workshop with Neo4j as topic 
>> in June around Berlin Buzzwords as you have done it last year? - I would be 
>> interested to take part.
>>
>> The week before I will met people from SLUB in Dresden.
>>
>> Günter
>>
>>
>>
>> On Tuesday, March 22, 2016 at 8:16:32 AM UTC+1, Michael Hunger wrote:
>>>
>>> Hi Guenter,
>>>
>>>
>>> Am 21.03.2016 um 23:43 schrieb 'Guenter Hipler' via Neo4j <
>>> [email protected]>:
>>>
>>> Hi
>>>
>>> we are running our first steps with Neo4j and used various alternatives 
>>> to create an initial database
>>>
>>> 1) we used the Java API with an embedded database
>>> here 
>>>
>>> https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76
>>> a transaction is closed which surrounds 20.000 nodes with relationships 
>>> to around 40.000 other nodes. 
>>> We are surprised the Transaction.close() method needs up to 30 seconds 
>>> to write these nodes to disk
>>>
>>>
>>> It depends on your disk performance, I hope you're not using a spinning 
>>> disk?
>>>
>>> so you have 20k + 40k + 40k++ records that you write (plus properties) ?
>>> Then you'd need 4G heap and a fast disk to write them away quickly.
>>>
>>> An option is to reduce batch size to e.g. 10k per tx in total. If your 
>>> domain allows it you can also parallelize node-creation and rel-creation 
>>> each (watch out for writing to the same nodes though). 
>>>
>>> I have some comments on your code below, esp. the tx handling for the 
>>> schema creation has to be fixed.
>>>
>>> For the initial import neo4j-import should work well for you.
>>>
>>> You only made the mistake of using your first file twice on the command 
>>> line, you probably wanted to use the second file in the second place.
>>>
>>> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/*br.csv* 
>>> --nodes files/*br.csv* --relationships:SIGNATUREOF files/signatureof.csv
>>>
>>>
>>> You can also provide the overarching label for the nodes on the 
>>> commandline.
>>>
>>>
>>> Cheers, Michael
>>>
>>>
>>> 2) then I wanted to compare my results with the neo4j-import script 
>>> provided by the Neo4J-server
>>> Using this method I have difficulties with the format of the csv-files
>>>
>>> My small examples:
>>> first node file:
>>> lsId:ID(localsignature),:LABEL
>>> "NEBIS/002527587",LOCALSIGNATURE
>>> "OCoLC/637556711",LOCALSIGNATURE
>>>
>>>
>>>
>>> second node file:
>>> brId:ID(bibliographicresource),active,:LABEL
>>> 146404300,true,BIBLIOGRAPHICRESOURCE
>>>
>>>
>>> relationship file
>>> :START_ID(bibliographicresource),:END_ID(localsignature),:TYPE
>>> 146404300,"NEBIS/002527587",SIGNATUREOF
>>> 146404300,"OCoLC/637556711",SIGNATUREOF
>>>
>>> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes 
>>> files/br.csv --relationships:SIGNATUREOF files/signatureof.csv
>>> which throws the exception 
>>>
>>> Done in 191ms
>>> Prepare node index
>>> Exception in thread "Thread-3" 
>>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException:
>>>  
>>> Id '146404300' is defined more than once in bibliographicresource, at least 
>>> at 
>>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 
>>> and 
>>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.input.BadCollector$2.exception(BadCollector.java:107)
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.input.BadCollector.checkTolerance(BadCollector.java:176)
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:96)
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:590)
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:494)
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:282)
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
>>>     at 
>>> org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)
>>> Duplicate input ids that would otherwise clash can be put into separate 
>>> id space, read more about how to use id spaces in the manual: 
>>> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
>>> Caused by:Id '146404300' is defined more than once in 
>>> bibliographicresource, at least at 
>>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 
>>> and 
>>> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>>>
>>>
>>> I can't see any differences in the documentation of
>>>
>>> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
>>> because I tried to use the ID space notation (as far as I can see...)
>>>
>>>
>>>
>>> Code comments:
>>>
>>> package org.swissbib.linked.mf.writer;
>>>
>>>
>>> @Description("Transforms documents to a Neo4j graph.")
>>> @In(StreamReceiver.class)
>>> @Out(Void.class)
>>> public class NeoIndexer extends DefaultStreamPipe<ObjectReceiver<String>> {
>>>
>>>     private final static Logger LOG = 
>>> LoggerFactory.getLogger(NeoIndexer.class);
>>>     GraphDatabaseService graphDb;
>>>     File dbDir;
>>>     Node mainNode;
>>>     Transaction tx;
>>>     int batchSize;
>>>     int counter = 0;
>>>     boolean firstRecord = true;
>>>
>>>
>>>     public void setBatchSize(String batchSize) {
>>>         this.batchSize = Integer.parseInt(batchSize);
>>>     }
>>>
>>>     public void setDbDir(String dbDir) {
>>>         this.dbDir = new File(dbDir);
>>>     }
>>>
>>>     @Override
>>>     public void startRecord(String identifier) {
>>>
>>>         if (firstRecord) {
>>> // is there no explicit onStartStream method in your API ?
>>>
>>> // otherwise it might be better to pass the graphDB in to the constructor 
>>> or via  a setter
>>>
>>>             graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(dbDir);
>>>             tx = graphDb.beginTx();
>>>             graphDb.schema().indexFor(lsbLabels.PERSON).on("name");
>>>             graphDb.schema().indexFor(lsbLabels.ORGANISATION).on("name");
>>>             
>>> graphDb.schema().indexFor(lsbLabels.BIBLIOGRAPHICRESOURCE).on("name");
>>>             
>>> graphDb.schema().constraintFor(lsbLabels.PERSON).assertPropertyIsUnique("name");
>>>             
>>> graphDb.schema().constraintFor(lsbLabels.ORGANISATION).assertPropertyIsUnique("name");
>>>             
>>> graphDb.schema().constraintFor(lsbLabels.BIBLIOGRAPHICRESOURCE).assertPropertyIsUnique("name");
>>>             
>>> graphDb.schema().constraintFor(lsbLabels.ITEM).assertPropertyIsUnique("name");
>>>             
>>> graphDb.schema().constraintFor(lsbLabels.LOCALSIGNATURE).assertPropertyIsUnique("name");
>>>             tx.success();
>>>
>>> // misses tx.close() as this is a schema tx which can't be mixed with data 
>>> tx
>>>
>>> // also if this tx is not finished the indexes and constrains will not be 
>>> in place so your lookups will be slow
>>>             firstRecord = false;
>>> // create new tx after tx.close()
>>>
>>>         }
>>>
>>>         counter += 1;
>>>         LOG.debug("Working on record {}", identifier);
>>>         if (identifier.contains("person")) {
>>>             mainNode = createNode(lsbLabels.PERSON, identifier, false);
>>>         } else if (identifier.contains("organisation")) {
>>>             mainNode = createNode(lsbLabels.ORGANISATION, identifier, 
>>> false);
>>>         } else {
>>>             mainNode = createNode(lsbLabels.BIBLIOGRAPHICRESOURCE, 
>>> identifier, true);
>>>         }
>>>
>>>     }
>>>
>>>     @Override
>>>     public void endRecord() {
>>>         tx.success();
>>>         if (counter % batchSize == 0) {
>>>             LOG.info("Commit batch upload ({} records processed so far)", 
>>> counter);
>>>             tx.close();
>>>             tx = graphDb.beginTx();
>>>         }
>>>         super.endRecord();
>>>     }
>>>
>>>     @Override
>>>     public void literal(String name, String value) {
>>>         Node node;
>>>
>>>         switch (name) {
>>>             case "br":
>>>                 node = graphDb.findNode(lsbLabels.BIBLIOGRAPHICRESOURCE, 
>>> "name", value);
>>>                 mainNode.createRelationshipTo(node, 
>>> lsbRelations.CONTRIBUTOR);
>>>                 break;
>>>             case "bf:local":
>>>                 node = createNode(lsbLabels.LOCALSIGNATURE, value, false);
>>>                 node.createRelationshipTo(mainNode, 
>>> lsbRelations.SIGNATUREOF);
>>>                 break;
>>>             case "item":
>>>                 node = createNode(lsbLabels.ITEM, value, false);
>>>                 node.createRelationshipTo(mainNode, lsbRelations.ITEMOF);
>>>                 break;
>>>         }
>>>     }
>>>
>>> // naming of variables!
>>>
>>> // you might consider to use an :Active label instead which is more 
>>> efficient than a property
>>>
>>> // but good that you only set the property for the true value
>>>
>>>     private Node createNode(Label l, String v, boolean a) {
>>>         Node n = graphDb.createNode(l);
>>>         n.setProperty("name", v);
>>>         if (a)
>>>             n.setProperty("active", "true");
>>>         return n;
>>>     }
>>>
>>>     @Override
>>>     protected void onCloseStream() {
>>>         LOG.info("Cleaning up (altogether {} records processed)", counter);
>>> // does this always happen after an endRecord? otherwise you need a 
>>> tx.success() here
>>>
>>>         tx.close();
>>>     }
>>>
>>>     private enum lsbLabels implements Label {
>>>         BIBLIOGRAPHICRESOURCE, PERSON, ORGANISATION, ITEM, LOCALSIGNATURE
>>>     }
>>>
>>>     public enum lsbRelations implements RelationshipType {
>>>         CONTRIBUTOR, ITEMOF, SIGNATUREOF
>>>     }
>>> }
>>>
>>>
>>>
>>> Thanks for any hints!
>>>
>>> Günter
>>>
>>>
>>> -- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "Neo4j" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected].
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Neo4j" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

Reply via email to