I have an OrientDB application using the Java (Tinkerpop) API against OrientDB 2.2.3 running on Linux. While importing data from MediaWiki XML dumps (e.g., Wikipedia), I need to do an INSERT-IF-NOT-EXISTS type operation. I ran into an issue earlier and it appeared to be working beautifully after some help from Luca.
After working with the data for a little while, I discovered that my identifiers (which are actually URIs, like https://en.wikipedia.org/wiki/OrientDB) need to be indexed in a case-insensitive manner, because MediaWiki is inconsistent (or "flexible," depending on your perspective) about casing in URIs. So, I modified the creation of my UNIQUE_HASH_MAP index to include the "collation=ci" parameter. Now, as I am parsing the MediaWiki XML and loading the pages (as vertices) and links (as edges) into an OrientDB-based graph, I get an ORecordDuplicatedException: com.orientechnologies.orient.core.storage.ORecordDuplicatedException: Cannot index record #124:14: found duplicated key 'https://en.wikipedia.org/wiki/fédération_anarchiste' in index 'Identifier.identifier' previously assigned to the record #109:13 DB name="kb"INDEX=Identifier.identifier RID=#109:13 at com.orientechnologies.orient.core.index.OIndexUnique.put(OIndexUnique.java:64) at com.orientechnologies.orient.core.index.OIndexUnique.put(OIndexUnique.java:34) at com.orientechnologies.orient.core.index.OIndexAbstract.putInSnapshot(OIndexAbstract.java:911) at com.orientechnologies.orient.core.index.OIndexAbstract.applyIndexTxEntry(OIndexAbstract.java:756) at com.orientechnologies.orient.core.index.OIndexAbstract.addTxOperation(OIndexAbstract.java:729) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commitIndexes(OAbstractPaginatedStorage.java:1387) at com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:1348) at com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:555) at com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:109) at com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2665) at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.executeOutsideTx(OrientBaseGraph.java:1824) at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.createEdgeType(OrientBaseGraph.java:1481) at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.createEdgeType(OrientBaseGraph.java:1424) at com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.createEdgeType(OrientBaseGraph.java:1395) at com.tinkerpop.blueprints.impls.orient.OrientGraph.addEdgeInternal(OrientGraph.java:318) at com.tinkerpop.blueprints.impls.orient.OrientVertex.addEdge(OrientVertex.java:717) at com.tinkerpop.blueprints.impls.orient.OrientVertex.addEdge(OrientVertex.java:656) Here are some experiments that I have conducted and their results: 1) If I remove the "collation=ci" parameter from the index, the exception does not occur. Of course, the identifiers then become case-sensitive and I end up with multiple vertices for different casings of the same URI. 2) Although my example above shows a URI with accented characters, that appears to be coincidental. It is just the first URI in my data set that this problem happens to occur with. I have written unit tests around URIs containing accented characters (technically IRIs) and they all pass. 3) The results are exactly the same if I try this with the SB-Tree index (index type=UNIQUE instead of UNIQUE_HASH_MAP, collation=ci). I would strongly prefer not to simply case-fold the URIs, because https://en.wikipedia.org/wiki/OrientDB is the correct, canonical English Wikipedia URI for OrientDB, https://en.wikipedia.org/orientdb is not. Any suggestions on this? Am I doing something wrong? Is it a bug? Is there a work-around? -- John -- --- You received this message because you are subscribed to the Google Groups "OrientDB" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
