[
https://issues.apache.org/jira/browse/JENA-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15994082#comment-15994082
]
ASF GitHub Bot commented on JENA-1313:
--------------------------------------
Github user kinow commented on the issue:
https://github.com/apache/jena/pull/237
@osma,
>Did you by any chance test this with the performance test case that I
wrote up earlier? I'd like to know how it compares to a plain ORDER BY in terms
of performance. I can test that myself too when I have a suitable slot of time,
but that might take a while since many deadlines are coming up in the next few
days...
Well remembered. Updated my sandbox to include [a JMH
test](https://github.com/kinow/jena-arq-filter/blob/master/src/test/java/br/eti/kinoshita/jena/ArqOrderByTest.java#L24).
Initial version was using the average time. Here are the results.
```
Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByCollation":
3058822481.830 ±(99.9%) 51737778.554 ns/op [Average]
(min, avg, max) = (2669383311.000, 3058822481.830, 3841515554.000), stdev
= 219060994.044
CI (99.9%): [3007084703.276, 3110560260.384] (assumes normal distribution)
Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByLang":
3017545546.455 ±(99.9%) 47500397.951 ns/op [Average]
(min, avg, max) = (2646169688.000, 3017545546.455, 3499258012.000), stdev
= 201119659.239
CI (99.9%): [2970045148.504, 3065045944.406] (assumes normal distribution)
# Run complete. Total time: 00:41:33
Benchmark Mode Cnt Score
Error Units
ArqOrderByTest.testOrderByCollation avgt 200 3058822481.830 ±
51737778.554 ns/op
ArqOrderByTest.testOrderByLang avgt 200 3017545546.455 ±
47500397.951 ns/op
```
Then updated it to actually benchmark the throughput.
```
Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByCollation":
≈ 10⁻⁹ ops/ns
Result "br.eti.kinoshita.jena.ArqOrderByTest.testOrderByLang":
≈ 10⁻⁹ ops/ns
Benchmark Mode Cnt Score Error Units
ArqOrderByTest.testOrderByCollation thrpt 200 ≈ 10⁻⁹ ops/ns
ArqOrderByTest.testOrderByLang thrpt 200 ≈ 10⁻⁹ ops/ns
```
Throughput displays no difference. Average time was about the same for
minimum, but average and max displayed a slight increase when using collation.
But I think the overhead won't be really noticeable for end users.
> Language-specific collation in ARQ
> ----------------------------------
>
> Key: JENA-1313
> URL: https://issues.apache.org/jira/browse/JENA-1313
> Project: Apache Jena
> Issue Type: Improvement
> Components: ARQ
> Affects Versions: Jena 3.2.0
> Reporter: Osma Suominen
>
> As [discussed|http://markmail.org/message/v2bvsnsza5ksl2cv] on the users
> mailing list in October 2016, I would like to change ARQ collation of literal
> values to be language-aware and respect language-specific collation rules.
> This would probably involve changing at least the
> [NodeUtils.compareLiteralsBySyntax|https://github.com/apache/jena/blob/master/jena-arq/src/main/java/org/apache/jena/sparql/util/NodeUtils.java#L199]
> method.
> It currently sorts by lexical value first, then by language tag. Since the
> collation order needs to be stable across all possible literal values, I
> think the safest way would be to sort by language tag first, then by lexical
> value according to the collation rules for that language.
> But what about subtags like {{@en-US}} or {{@pt-BR}}? Can they have different
> collation rules than the main language? It would be a bit strange if all
> {{@en-US}} literals sorted after {{@en}} literals...
> It would be good to check how Dydra does this and possibly take the same
> approach. See the message linked above for further backgound.
> I've been talking with [~kinow] about this and he may be interested in
> implementing it.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)