I have an OrientDB application using the Java (Tinkerpop) API against 
OrientDB 2.2.3 running on Linux. While importing data from MediaWiki XML 
dumps (e.g., Wikipedia), I need to do an INSERT-IF-NOT-EXISTS type 
operation. I ran into an issue earlier and it appeared to be working 
beautifully after some help from Luca.

After working with the data for a little while, I discovered that my 
identifiers (which are actually URIs, like 
https://en.wikipedia.org/wiki/OrientDB) need to be indexed in a 
case-insensitive manner, because MediaWiki is inconsistent (or "flexible," 
depending on your perspective) about casing in URIs.

So, I modified the creation of my UNIQUE_HASH_MAP index to include the 
"collation=ci" parameter.

Now, as I am parsing the MediaWiki XML and loading the pages (as vertices) 
and links (as edges) into an OrientDB-based graph, I get an 
ORecordDuplicatedException:

com.orientechnologies.orient.core.storage.ORecordDuplicatedException: 
Cannot index record #124:14: found duplicated key 
'https://en.wikipedia.org/wiki/fédération_anarchiste' in index 
'Identifier.identifier' previously assigned to the record #109:13
DB name="kb"INDEX=Identifier.identifier RID=#109:13
at 
com.orientechnologies.orient.core.index.OIndexUnique.put(OIndexUnique.java:64)
at 
com.orientechnologies.orient.core.index.OIndexUnique.put(OIndexUnique.java:34)
at 
com.orientechnologies.orient.core.index.OIndexAbstract.putInSnapshot(OIndexAbstract.java:911)
at 
com.orientechnologies.orient.core.index.OIndexAbstract.applyIndexTxEntry(OIndexAbstract.java:756)
at 
com.orientechnologies.orient.core.index.OIndexAbstract.addTxOperation(OIndexAbstract.java:729)
at 
com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commitIndexes(OAbstractPaginatedStorage.java:1387)
at 
com.orientechnologies.orient.core.storage.impl.local.OAbstractPaginatedStorage.commit(OAbstractPaginatedStorage.java:1348)
at 
com.orientechnologies.orient.core.tx.OTransactionOptimistic.doCommit(OTransactionOptimistic.java:555)
at 
com.orientechnologies.orient.core.tx.OTransactionOptimistic.commit(OTransactionOptimistic.java:109)
at 
com.orientechnologies.orient.core.db.document.ODatabaseDocumentTx.commit(ODatabaseDocumentTx.java:2665)
at 
com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.executeOutsideTx(OrientBaseGraph.java:1824)
at 
com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.createEdgeType(OrientBaseGraph.java:1481)
at 
com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.createEdgeType(OrientBaseGraph.java:1424)
at 
com.tinkerpop.blueprints.impls.orient.OrientBaseGraph.createEdgeType(OrientBaseGraph.java:1395)
at 
com.tinkerpop.blueprints.impls.orient.OrientGraph.addEdgeInternal(OrientGraph.java:318)
at 
com.tinkerpop.blueprints.impls.orient.OrientVertex.addEdge(OrientVertex.java:717)
at 
com.tinkerpop.blueprints.impls.orient.OrientVertex.addEdge(OrientVertex.java:656)

Here are some experiments that I have conducted and their results:

1) If I remove the "collation=ci" parameter from the index, the exception 
does not occur. Of course, the identifiers then become case-sensitive and I 
end up with multiple vertices for different casings of the same URI.

2) Although my example above shows a URI with accented characters, that 
appears to be coincidental. It is just the first URI in my data set that 
this problem happens to occur with. I have written unit tests around URIs 
containing accented characters (technically IRIs) and they all pass.

3) The results are exactly the same if I try this with the SB-Tree index 
(index type=UNIQUE instead of UNIQUE_HASH_MAP, collation=ci).

I would strongly prefer not to simply case-fold the URIs, because 
https://en.wikipedia.org/wiki/OrientDB is the correct, canonical English 
Wikipedia URI for OrientDB, https://en.wikipedia.org/orientdb is not.

Any suggestions on this? Am I doing something wrong? Is it a bug? Is there 
a work-around?

-- John

-- 

--- 
You received this message because you are subscribed to the Google Groups 
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to