Hi Ivan,

I think this is an interesting avenue to explore adding to the platform. The idea of sorting this way is pretty subtle and it seems to come up frequently, so it seems valuable. There are some issues that warrant further discussion, though. Briefly:

1. Should this be in the JDK?
2. What do other platforms do?
3. Does it have the right semantics?

Discussion follows.


--


1. Should this be in the JDK?

I think a case for it can be made. It does appear in other platforms (see below) and there are also several third party implementations available in a variety of environments. So people do have a need for this feature. It's also complicated enough to have generated lots of discussions and articles on the topic. The questions are whether this can be specified sufficiently clearly, and whether it provides value for the use cases for which it's intended. It's not obvious whether this is true, but I believe a case can and should be made.


2. What do other platforms do?

It was a bit difficult to find information about this, since it doesn't seem to have a well established name. Words like "natural", "logical", "alphanum", and "mixed" tend to be used. I eventually found these:

Windows XP StrCmpLogicalW [1]:

    Compares two Unicode strings. Digits in the strings are considered as
    numerical content rather than text. This test is not case-sensitive.

Windows 7 CompareStringEx SORT_DIGITSASNUMBERS [2]

    Treat digits as numbers during sorting, for example, sort "2" before "10".

    (Note: this API takes a locale parameter.)

Macintosh Mac NSString localizedStandardCompare [3]

    This method should be used whenever file names or other strings are
    presented in lists and tables where Finder-like sorting is appropriate.
    The exact sorting behavior of this method is different under different
    locales and may be changed in future releases. This method uses the
    current locale.

    (Note: I observe that the Mac Finder sorting is case insensitive.)

Swift String.localizedStandardCompare [4]

    Compares strings as sorted by the Finder.

There are also third party, open source implementations available for a variety of platforms. These aren't too hard to find; this Coding Horror article [5] has a discussion of the issues and links to several implementations. Of particular note is the short Python implementation embedded in the article.

There is also the Node package javascript-natural-sort [6] which is one of several (of course) similar packages on NPM. This one seems popular, with more than 200,000 downloads in the past month.

Finally, there is mention of "numericOrdering" in this Unicode TR [7] but it seems fairly non-specific, and I don't know how it applies. The point here is that the Unicode community is aware of this kind of ordering, and various libraries that implement Unicode collation, such as ICU [8], might have implementations that can provide guidance.


3. Does it have the right semantics?

I think you can see from the above survey that there is no standard, and different implementations are all over the map, and most if not all are completely ill-specified. But what is useful about the survey is that it shows what people are actually using, and that there are things that many of them have in common. Two items jump out at me:

 - case-insensitive comparison (sometimes optional)
 - locale-specific collation

The obvious (but simplistic) thing to do is to provide variations of this API that can use String.CASE_INSENSITIVE_ORDER. Note however that its doc specifically states that it provides "unsatisfactory ordering for certain locales" and directs the reader to the Collator class, which does take locale into account.

Now, I'm sensitive about making this more complicated than necessary. But the point of "logical" comparator is to provide something that makes sense to humans looking at the result, which implies that locale-specific collation needs to be applied, as well as case insensitivity (which itself is locale-specific). So I think consideration of those is indeed necessary.

I don't know what the API should look like. The java.text.Collator class implements Comparator. This suggests the possibility of an API that allows a "downstream" comparator to be specified, to which ordering of certain subsequences can be delegated.

s'marks



[1] 
https://msdn.microsoft.com/en-us/library/windows/desktop/bb759947(v=vs.85).aspx

[2] 
https://msdn.microsoft.com/en-us/library/windows/desktop/dd317761(v=vs.85).aspx

[3] https://developer.apple.com/documentation/foundation/nsstring/1409742-localizedstandardcompare?language=objc

[4] https://developer.apple.com/documentation/swift/string/1408384-localizedstandardcompare

[5] https://blog.codinghorror.com/sorting-for-humans-natural-sort-order/

[6] https://www.npmjs.com/package/javascript-natural-sort

[7] http://unicode.org/reports/tr35/tr35-collation.html#Setting_Options

[8] http://userguide.icu-project.org/collation



On 7/19/17 1:41 AM, Ivan Gerasimov wrote:
Hello!

It is a proposal to provide a String comparator, which will pay attention to the
numbers embedded into the strings (should they present).

This proposal was initially discussed back in 2014 and seemed to bring some
interest from the community:
http://mail.openjdk.java.net/pipermail/core-libs-dev/2014-December/030343.html

In the latest webrev two methods are added to the public API:
j.u.Comparator.comparingNumerically() and
j.u.Comparator.comparingNumericallyLeadingZerosAhead().

The regression test is extended to exercise this new comparator.

BUGURL: https://bugs.openjdk.java.net/browse/JDK-8134512
WEBREV: http://cr.openjdk.java.net/~igerasim/8134512/01/webrev/

Comments, suggestions are very welcome!

Reply via email to