JENA-2333 reports a problem with sorting where the JDK sort (Arrays.sort) throws an exception. The problem is that sorting is a "total order" where the condition "A < B , B < C => A < C" must be true.

There are two comparisons types - "by value" and "by term". "by term" is done comparing lexical forms/datatypes/language.

However, when terms to be sorted are of different kinds, a mixture of "by value" and "by term" can arise and the transitive relation of a total order is broken.

Sometimes, Arrays.sort notices this and throws an exception. Whether the
exception occurs depends on the start order of the array to be sorted.
This is old code that hasn't changed in quite sometime. Presumably, it happens infrequently, give the lack of reports until now. It took quite a lot of randomized testing to narrow down to a consistent test case.

Any fix is application-visible.

Currently, by lexical terms:

"a"@en < "b"@de < "c"@en

but when two @en are sorted together (value sorting extended to lang strings) we can get:

"b"@de < "a"@en < "c"@en

Currently, either can happen, or an exception, it depends on size and the initial order.

The proposed change is PR#1406. It changes the "by term" sorting to be lang tag sensitive for RDF terms so the sorting is the same for "by value" and "by term". Various tests change.

"b"@de, "a"@en, "c"@en

because "de" < "en" as strings.

Another proposal is PR#1404 which changes lang tag sorting to be "by lexical form" always. While this less test change, "same language" terms are no longer grouped together.

"a"@en , "b"@de , "c"@en ,....

Behaviour of other triple stores isn't uniform.


Proposal: PR#1406 - like language tags groups together. Sorting for literals is:

  strings (xsd:string) < lang string (rdf:langString) < other datatypes.

which is the compatible (RDF 1.0 and RDF 1.1) and I would say, least surprising.

PR#1406 does not change the sorting of other datatypes.

    Andy

FYI Comparing "other datatypes" by datatype URI then by lexical form does not work; it nearly does, but not quite, for XSD datatypes).

It is beginning to look like there is not much choice given the partial conditions in the spec for the case of other datatypes.

Reply via email to