[orientdb] Duplicate Documents Inserted by OrientDB

Jake Tue, 19 May 2015 11:00:08 -0700

Hello,

To get an idea of the context of the situation, I was testing OrientDB a 
while ago for my workplace by inserting 10 million documents into a 
database. The document contained a 'hash' field of type String. At first I 
had a unique index on that 'hash' field,  but I changed it to a nonunique 
index when I ran into an odd situation.


Assumptions:
- Remote connection to a server was used by the client
- Only 1 server was used, but was configured for a distributed cluster

The first time I ran this test with the unique index on the 'hash' field, I 
was getting some collision exceptions and I knew that shouldn't have been 
possible since I was using SHA256 hashes and the inputs for the hash 
algorithm were unique. After using a nonunique index with the 'hash' field 
I saw several quorum exceptions that were thrown throughout the insertions. 
My write quorum is set to 1 and there is only 1 node in the cluster though.

After using the console to look at the database I noticed there were 7 
extra Hash documents inserted into the database. I inserted 10 million Hash 
documents starting with an empty database, but there were 10,000,007 listed 
when I used the 'info' command. This amount of duplicate documents would 
vary each time I ran this test as well. I later added the input to the 
SHA256 hash as a field to the documents and wrote some Java code to double 
check that these extra documents were duplicates as well. They did end up 
being duplicates. I know for sure that I did not insert them into the 
database.

Does anybody have any ideas of what the problem could be? I'm thinking it 
has to do with the fact that my server is setup to be distributed, but 
doesn't have any other servers in the cluster to talk to. I'm not entirely 
sure though. All I know is that my boss doesn't like that OrientDB is 
inserting duplicate data into the database. I'm sure a unique index would 
solve the problem, but it makes him feel like he can't trust OrientDB with 
keeping an accurate record of the data we insert. The data we're planning 
on storing in OrientDB will have to be accurate and duplicate data might be 
bad for our application.

Thanks for reading!

My OrientDB server uses the following dristributed configuration file:

{
    "autoDeploy": true,
    "hotAlignment": false,
    "executionMode": "undefined",
    "readQuorum": 1,
    "writeQuorum": 1,
    "failureAvailableNodesLessQuorum": false,
    "readYourWrites": true,
    "clusters": {
        "internal": {
        },
        "index": {
        },
        "*": {
            "servers" : ["<NEW_NODE>"]
        }
    }
}

Here's a simplified version of my program that inserted the documents:

//This created the schema for the Hash document
String database = "remote:127.0.0.1/hashdb";
OPartitionedDatabasePool pool           = new 
OPartitionedDatabasePool(database, "root", "asdf123$");
ODatabaseDocumentTx db                  = pool.acquire();

try {
    OSchemaProxy schema = db.getMetadata().getSchema();
    OClass strHash = schema.createClass("Hash");
    strHash.createProperty("hash", 
OType.STRING).setMandatory(true).setNotNull(true);
    strHash.createProperty("index", 
OType.INTEGER).setMandatory(true).setNotNull(true);
    strHash.createIndex("hash", OClass.INDEX_TYPE.NOTUNIQUE, "hash");

    /*
    for (int i=1; i<16; i++) {
        strHash.createProperty(String.format("field%d", i), OType.STRING);
    }*/

    schema.save();
} catch(Exception e) {
    e.printStackTrace();
}
db.close();
pool.close();

//This part was used for the insertion of the Hash documents.
OPartitionedDatabasePool pool           = new 
OPartitionedDatabasePool("remote:127.0.0.1/hashdb", "root", "asdf123$");
  
for (int i=0; i<TOTAL_HASHES; i++) {
  ODatabaseDocumentTx db = pool.acquire();
  try {
      ODocument hash = new ODocument("Hash");

      String hashInput = getSHAHash(i);

      hash.field("hash", hashInput);
      hash.field("index", i);
      hash.save();
  } catch (Exception e) {
      e.printStackTrace();
      db.rollback();
      continue;
  } finally {
      db.commit();
      db.close();
  }
}

pool.close();

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[orientdb] Duplicate Documents Inserted by OrientDB

Reply via email to