[orientdb] De-duplication is very slow?

Sven Hodapp Tue, 21 Apr 2015 08:04:02 -0700

Hi together,

for my project I want to import a lot of data into the database. But the 
data should be de-duplicated.


First of all, without de-duplication the insertion takes about 500 ms (a 
minimal test set):

    docelem = graph.addVertex("class:DocElem", "uri", uri, "type", type, 
"model", model);

    docelem.setProperties(attrs);

    graph.commit();


Now I *don't* want that the is uploaded twice, so I'll check it like this, 
if it's already in the database:


    Iterable<Vertex> iter = graph.query()

        .has("uri", Compare.EQUAL, uri)

        .limit(1)

        .vertices();


Then, with iter.iterator().hasNext() I'm checking, if the vertex is already 
in the database. But this is dead slow (even indexed, or I've made a 
mistake)! Now it takes about 15 s for inserting.


You can suggest a better solution? The best case would be, if I don't have 
to call the database; and the database recognizes that the requested uri is 
already inserted and may only update the entry, or something like that!


Note 1: With println instead of db-insert the code needs about 50 ms to 
fetch/create the data. Is is possible to go faster?

Note 2: I'm using OrientDB 2.1-rc1 with remote connection (on the same 
host).


Regards,

Sven

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[orientdb] De-duplication is very slow?

Reply via email to