Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

Michael Hunger Tue, 22 Mar 2016 00:16:51 -0700

Hi Guenter,


> Am 21.03.2016 um 23:43 schrieb 'Guenter Hipler' via Neo4j 
> <[email protected]>:
> 
> Hi
> 
> we are running our first steps with Neo4j and used various alternatives to 
> create an initial database
> 
> 1) we used the Java API with an embedded database
> here 
> https://github.com/linked-swissbib/swissbib-metafacture-commands/blob/neo4j-tests/src/main/java/org/swissbib/linked/mf/writer/NeoIndexer.java#L76
> a transaction is closed which surrounds 20.000 nodes with relationships to 
> around 40.000 other nodes. 
> We are surprised the Transaction.close() method needs up to 30 seconds to 
> write these nodes to disk
> 

It depends on your disk performance, I hope you're not using a spinning disk?

so you have 20k + 40k + 40k++ records that you write (plus properties) ?
Then you'd need 4G heap and a fast disk to write them away quickly.

An option is to reduce batch size to e.g. 10k per tx in total. If your domain 
allows it you can also parallelize node-creation and rel-creation each (watch 
out for writing to the same nodes though). 

I have some comments on your code below, esp. the tx handling for the schema 
creation has to be fixed.

For the initial import neo4j-import should work well for you.

You only made the mistake of using your first file twice on the command line, 
you probably wanted to use the second file in the second place.

> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes 
> files/br.csv --relationships:SIGNATUREOF files/signatureof.csv

You can also provide the overarching label for the nodes on the commandline.


Cheers, Michael

> 
> 2) then I wanted to compare my results with the neo4j-import script provided 
> by the Neo4J-server
> Using this method I have difficulties with the format of the csv-files
> 
> My small examples:
> first node file:
> lsId:ID(localsignature),:LABEL
> "NEBIS/002527587",LOCALSIGNATURE
> "OCoLC/637556711",LOCALSIGNATURE
> 
> 
> 
> second node file:
> brId:ID(bibliographicresource),active,:LABEL
> 146404300,true,BIBLIOGRAPHICRESOURCE
> 
> 
> relationship file
> :START_ID(bibliographicresource),:END_ID(localsignature),:TYPE
> 146404300,"NEBIS/002527587",SIGNATUREOF
> 146404300,"OCoLC/637556711",SIGNATUREOF
> 
> ./neo4j-import --into [path-to-db]/test.db/ --nodes files/br.csv --nodes 
> files/br.csv --relationships:SIGNATUREOF files/signatureof.csv
> which throws the exception 
> 
> Done in 191ms
> Prepare node index
> Exception in thread "Thread-3" 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.DuplicateInputIdException:
>  Id '146404300' is defined more than once in bibliographicresource, at least 
> at /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 
> and /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
>     at 
> org.neo4j.unsafe.impl.batchimport.input.BadCollector$2.exception(BadCollector.java:107)
>     at 
> org.neo4j.unsafe.impl.batchimport.input.BadCollector.checkTolerance(BadCollector.java:176)
>     at 
> org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:96)
>     at 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:590)
>     at 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:494)
>     at 
> org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:282)
>     at 
> org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
>     at 
> org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)
> Duplicate input ids that would otherwise clash can be put into separate id 
> space, read more about how to use id spaces in the manual: 
> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
> Caused by:Id '146404300' is defined more than once in bibliographicresource, 
> at least at 
> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2 and 
> /home/swissbib/environment/tools/neo4j-community-2.3.2/bin/files/br.csv:2
> 
> 
> I can't see any differences in the documentation of
> http://neo4j.com/docs/2.3.2/import-tool-header-format.html#import-tool-id-spaces
> because I tried to use the ID space notation (as far as I can see...)
> 


Code comments:

package org.swissbib.linked.mf.writer;


@Description("Transforms documents to a Neo4j graph.")
@In(StreamReceiver.class)
@Out(Void.class)
public class NeoIndexer extends DefaultStreamPipe<ObjectReceiver<String>> {

    private final static Logger LOG = LoggerFactory.getLogger(NeoIndexer.class);
    GraphDatabaseService graphDb;
    File dbDir;
    Node mainNode;
    Transaction tx;
    int batchSize;
    int counter = 0;
    boolean firstRecord = true;


    public void setBatchSize(String batchSize) {
        this.batchSize = Integer.parseInt(batchSize);
    }

    public void setDbDir(String dbDir) {
        this.dbDir = new File(dbDir);
    }

    @Override
    public void startRecord(String identifier) {

        if (firstRecord) {
// is there no explicit onStartStream method in your API ?
// otherwise it might be better to pass the graphDB in to the constructor or 
via  a setter
            graphDb = new GraphDatabaseFactory().newEmbeddedDatabase(dbDir);
            tx = graphDb.beginTx();
            graphDb.schema().indexFor(lsbLabels.PERSON).on("name");
            graphDb.schema().indexFor(lsbLabels.ORGANISATION).on("name");
            
graphDb.schema().indexFor(lsbLabels.BIBLIOGRAPHICRESOURCE).on("name");
            
graphDb.schema().constraintFor(lsbLabels.PERSON).assertPropertyIsUnique("name");
            
graphDb.schema().constraintFor(lsbLabels.ORGANISATION).assertPropertyIsUnique("name");
            
graphDb.schema().constraintFor(lsbLabels.BIBLIOGRAPHICRESOURCE).assertPropertyIsUnique("name");
            
graphDb.schema().constraintFor(lsbLabels.ITEM).assertPropertyIsUnique("name");
            
graphDb.schema().constraintFor(lsbLabels.LOCALSIGNATURE).assertPropertyIsUnique("name");
            tx.success();
// misses tx.close() as this is a schema tx which can't be mixed with data tx
// also if this tx is not finished the indexes and constrains will not be in 
place so your lookups will be slow
            firstRecord = false;
// create new tx after tx.close()
        }


        counter += 1;
        LOG.debug("Working on record {}", identifier);
        if (identifier.contains("person")) {
            mainNode = createNode(lsbLabels.PERSON, identifier, false);
        } else if (identifier.contains("organisation")) {
            mainNode = createNode(lsbLabels.ORGANISATION, identifier, false);
        } else {
            mainNode = createNode(lsbLabels.BIBLIOGRAPHICRESOURCE, identifier, 
true);
        }

    }

    @Override
    public void endRecord() {
        tx.success();
        if (counter % batchSize == 0) {
            LOG.info("Commit batch upload ({} records processed so far)", 
counter);
            tx.close();
            tx = graphDb.beginTx();
        }
        super.endRecord();
    }

    @Override
    public void literal(String name, String value) {
        Node node;

        switch (name) {
            case "br":
                node = graphDb.findNode(lsbLabels.BIBLIOGRAPHICRESOURCE, 
"name", value);
                mainNode.createRelationshipTo(node, lsbRelations.CONTRIBUTOR);
                break;
            case "bf:local":
                node = createNode(lsbLabels.LOCALSIGNATURE, value, false);
                node.createRelationshipTo(mainNode, lsbRelations.SIGNATUREOF);
                break;
            case "item":
                node = createNode(lsbLabels.ITEM, value, false);
                node.createRelationshipTo(mainNode, lsbRelations.ITEMOF);
                break;
        }
    }
// naming of variables!
// you might consider to use an :Active label instead which is more efficient 
than a property
// but good that you only set the property for the true value

    private Node createNode(Label l, String v, boolean a) {
        Node n = graphDb.createNode(l);
        n.setProperty("name", v);
        if (a)
            n.setProperty("active", "true");
        return n;
    }

    @Override
    protected void onCloseStream() {
        LOG.info("Cleaning up (altogether {} records processed)", counter);
// does this always happen after an endRecord? otherwise you need a 
tx.success() here
        tx.close();
    }

    private enum lsbLabels implements Label {
        BIBLIOGRAPHICRESOURCE, PERSON, ORGANISATION, ITEM, LOCALSIGNATURE
    }

    public enum lsbRelations implements RelationshipType {
        CONTRIBUTOR, ITEMOF, SIGNATUREOF
    }
}


> Thanks for any hints!
> 
> Günter
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Neo4j" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected] 
> <mailto:[email protected]>.
> For more options, visit https://groups.google.com/d/optout 
> <https://groups.google.com/d/optout>.

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Re: [Neo4j] using various alternatives to create a Neo4j database for the first time

Reply via email to