Script 'mail_helper' called by obssrc
Hello community,
here is the log from the commit of package ghc-unicode-collation for
openSUSE:Factory checked in at 2026-06-10 16:09:08
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/ghc-unicode-collation (Old)
and /work/SRC/openSUSE:Factory/.ghc-unicode-collation.new.2375 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Package is "ghc-unicode-collation"
Wed Jun 10 16:09:08 2026 rev:10 rq:1358467 version:0.1.3.7
Changes:
--------
---
/work/SRC/openSUSE:Factory/ghc-unicode-collation/ghc-unicode-collation.changes
2025-01-28 16:41:23.587582668 +0100
+++
/work/SRC/openSUSE:Factory/.ghc-unicode-collation.new.2375/ghc-unicode-collation.changes
2026-06-10 16:13:57.668455023 +0200
@@ -1,0 +2,54 @@
+Sat Jun 6 10:48:22 UTC 2026 - Peter Simons <[email protected]>
+
+- Update unicode-collation to version 0.1.3.7.
+ ## 0.1.3.7
+
+ * Docs: recommend `sortOn (sortKey c)` over `sortBy (collate c)`.
+ `collate` recomputes the sort key of each argument on every
+ comparison, whereas `sortOn` computes each key only once. Update
+ the headline example and guidance accordingly.
+
+ * Replace IntSet lookups with range checks in implicit weight calculation.
+ `calculateImplicitWeight` previously tested CJK/ideograph
+ membership using large IntSets defined in its where-clause.
+ Since the data is essentially contiguous ranges, replace the
+ IntSets with simple range-comparison predicates. This removes
+ is ~25% faster on CJK-heavy sorting at -O1. Also drop a
+ duplicate range, remove the now-unused Data.IntSet import,
+ and correct a misleading (harmless, due to Word16 truncation)
+ 0x7FFFF mask to 0x7FFF in the fallback case.
+
+ * Test: guard implicit-weight ideograph ranges against data drift.
+ The CJK/Tangut/Nushu/Khitan ranges in Text.Collate.Collation are
+ hand-coded and decoupled from the shipped Unicode data. Add a test
+ that enumerates every code point the data marks as an ideograph
+ (from the @implicitweights directives in allkeys.txt and the CJK
+ Ideograph First/Last markers in UnicodeData.txt) and asserts each
+ receives an ideographic implicit primary weight (< 0xFBC0) rather
+ than the generic unassigned bucket. A future data bump that does
+ not update the ranges will now fail, naming the uncovered points.
+
+ * Benchmarks: force full sorts with `nf` instead of `whnf`.
+ `whnf` only evaluated the result list to its first cons cell, so the
+ benchmarks measured finding the minimum rather than a full sort. Using
+ `nf` forces the entire sorted list, giving a representative measurement.
+
+ * Add `bench-sortkey` benchmark comparing `collate` with `sortKey`.
+ This self-contained benchmark (no text-icu/QuickCheck deps) measures
+ `sortBy (collate c)` against `sortOn (sortKey c)` across several input
+ shapes, demonstrating the speedup from computing each sort key once.
+
+ * Add an `icu-benchmark` flag (default off) guarding the existing
+ text-icu benchmark, so the new benchmark can be built without the ICU
+ C library.
+
+ * Add a CJK-focused benchmark exercising the implicit weight path.
+
+ * Add flake.nix.
+
+ * CI: Support GHC 9.10 (David Binder), 9.12 (Austin Ziegler), and
+ 9.14 (Li-yao Xia).
+
+ * Add cabal.project.
+
+-------------------------------------------------------------------
Old:
----
unicode-collation-0.1.3.6.tar.gz
unicode-collation.cabal
New:
----
unicode-collation-0.1.3.7.tar.gz
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Other differences:
------------------
++++++ ghc-unicode-collation.spec ++++++
--- /var/tmp/diff_new_pack.vtylYA/_old 2026-06-10 16:14:02.100638694 +0200
+++ /var/tmp/diff_new_pack.vtylYA/_new 2026-06-10 16:14:02.124639688 +0200
@@ -1,7 +1,7 @@
#
# spec file for package ghc-unicode-collation
#
-# Copyright (c) 2025 SUSE LLC
+# Copyright (c) 2026 SUSE LLC
#
# All modifications and additions to the file contributed by third parties
# remain the property of their copyright owners, unless otherwise agreed
@@ -20,13 +20,12 @@
%global pkgver %{pkg_name}-%{version}
%bcond_with tests
Name: ghc-%{pkg_name}
-Version: 0.1.3.6
+Version: 0.1.3.7
Release: 0
Summary: Haskell implementation of the Unicode Collation Algorithm
License: BSD-2-Clause
URL: https://hackage.haskell.org/package/%{pkg_name}
Source0:
https://hackage.haskell.org/package/%{pkg_name}-%{version}/%{pkg_name}-%{version}.tar.gz
-Source1:
https://hackage.haskell.org/package/%{pkg_name}-%{version}/revision/2.cabal#/%{pkg_name}.cabal
BuildRequires: ghc-Cabal-devel
BuildRequires: ghc-base-devel
BuildRequires: ghc-base-prof
@@ -92,7 +91,6 @@
%prep
%autosetup -n %{pkg_name}-%{version}
-cp -p %{SOURCE1} %{pkg_name}.cabal
%build
%ghc_lib_build
++++++ unicode-collation-0.1.3.6.tar.gz -> unicode-collation-0.1.3.7.tar.gz
++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/CHANGELOG.md
new/unicode-collation-0.1.3.7/CHANGELOG.md
--- old/unicode-collation-0.1.3.6/CHANGELOG.md 2001-09-09 03:46:40.000000000
+0200
+++ new/unicode-collation-0.1.3.7/CHANGELOG.md 2001-09-09 03:46:40.000000000
+0200
@@ -2,6 +2,56 @@
`unicode-collation` uses [PVP Versioning](https://pvp.haskell.org).
+## 0.1.3.7
+
+ * Docs: recommend `sortOn (sortKey c)` over `sortBy (collate c)`.
+ `collate` recomputes the sort key of each argument on every
+ comparison, whereas `sortOn` computes each key only once. Update
+ the headline example and guidance accordingly.
+
+ * Replace IntSet lookups with range checks in implicit weight calculation.
+ `calculateImplicitWeight` previously tested CJK/ideograph
+ membership using large IntSets defined in its where-clause.
+ Since the data is essentially contiguous ranges, replace the
+ IntSets with simple range-comparison predicates. This removes
+ is ~25% faster on CJK-heavy sorting at -O1. Also drop a
+ duplicate range, remove the now-unused Data.IntSet import,
+ and correct a misleading (harmless, due to Word16 truncation)
+ 0x7FFFF mask to 0x7FFF in the fallback case.
+
+ * Test: guard implicit-weight ideograph ranges against data drift.
+ The CJK/Tangut/Nushu/Khitan ranges in Text.Collate.Collation are
+ hand-coded and decoupled from the shipped Unicode data. Add a test
+ that enumerates every code point the data marks as an ideograph
+ (from the @implicitweights directives in allkeys.txt and the CJK
+ Ideograph First/Last markers in UnicodeData.txt) and asserts each
+ receives an ideographic implicit primary weight (< 0xFBC0) rather
+ than the generic unassigned bucket. A future data bump that does
+ not update the ranges will now fail, naming the uncovered points.
+
+ * Benchmarks: force full sorts with `nf` instead of `whnf`.
+ `whnf` only evaluated the result list to its first cons cell, so the
+ benchmarks measured finding the minimum rather than a full sort. Using
+ `nf` forces the entire sorted list, giving a representative measurement.
+
+ * Add `bench-sortkey` benchmark comparing `collate` with `sortKey`.
+ This self-contained benchmark (no text-icu/QuickCheck deps) measures
+ `sortBy (collate c)` against `sortOn (sortKey c)` across several input
+ shapes, demonstrating the speedup from computing each sort key once.
+
+ * Add an `icu-benchmark` flag (default off) guarding the existing
+ text-icu benchmark, so the new benchmark can be built without the ICU
+ C library.
+
+ * Add a CJK-focused benchmark exercising the implicit weight path.
+
+ * Add flake.nix.
+
+ * CI: Support GHC 9.10 (David Binder), 9.12 (Austin Ziegler), and
+ 9.14 (Li-yao Xia).
+
+ * Add cabal.project.
+
## 0.1.3.6
* Update to build with GHC 9.8 (Laurent P. René de Cotret).
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/benchmark/Main.hs
new/unicode-collation-0.1.3.7/benchmark/Main.hs
--- old/unicode-collation-0.1.3.6/benchmark/Main.hs 2001-09-09
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/benchmark/Main.hs 2001-09-09
03:46:40.000000000 +0200
@@ -19,10 +19,13 @@
(randomTexts :: [Text]) <- generate (infiniteListOf arbitrary)
(randomLatinStrings :: [String]) <-
generate (infiniteListOf (listOf (elements latinChars)))
+ (randomCJKStrings :: [String]) <-
+ generate (infiniteListOf (listOf (elements cjkChars)))
(randomAsciiTexts :: [Text]) <-
generate (infiniteListOf (arbitrary `suchThat` T.all isAscii))
let tenThousand = take 10000 randomTexts
let tenThousandLatin = map T.pack $ take 10000 randomLatinStrings
+ let tenThousandCJK = map T.pack $ take 10000 randomCJKStrings
let tenThousandLatinNFD = map (T.pack . map chr . toNFD . map ord . T.unpack)
tenThousandLatin
let tenThousandString = map T.unpack tenThousand
@@ -33,28 +36,43 @@
let collateString = collateWithUnpacker (collatorFor "en") id
defaultMain
[ bench "sort a list of 10000 random Texts (en)"
- (whnf (sortBy (collate (collatorFor "en"))) tenThousand)
+ (nf (sortBy (collate (collatorFor "en"))) tenThousand)
, bench "sort same list with text-icu (en)"
- (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousand)
+ (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousand)
, bench "sort a list of 10000 Texts (composed latin) (en)"
- (whnf (sortBy (collate (collatorFor "en"))) tenThousandLatin)
+ (nf (sortBy (collate (collatorFor "en"))) tenThousandLatin)
, bench "sort same list with text-icu (en)"
- (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLatin)
+ (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLatin)
, bench "sort same list but pre-normalized (en-u-kk-false)"
- (whnf (sortBy (collate (collatorFor "en-u-kk-false")))
tenThousandLatinNFD)
+ (nf (sortBy (collate (collatorFor "en-u-kk-false")))
tenThousandLatinNFD)
+ , bench "sort a list of 10000 CJK Texts (en, implicit weights)"
+ (nf (sortBy (collate (collatorFor "en"))) tenThousandCJK)
+ , bench "sort same CJK list with text-icu (en)"
+ (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandCJK)
, bench "sort a list of 10000 ASCII Texts (en)"
- (whnf (sortBy (collate (collatorFor "en"))) tenThousandAscii)
+ (nf (sortBy (collate (collatorFor "en"))) tenThousandAscii)
, bench "sort same list with text-icu (en)"
- (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousandAscii)
+ (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandAscii)
, bench "sort a list of 10000 random Texts that agree in first 32 chars"
- (whnf (sortBy (collate (collatorFor "en"))) tenThousandLong)
+ (nf (sortBy (collate (collatorFor "en"))) tenThousandLong)
, bench "sort same list with text-icu (en)"
- (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLong)
+ (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLong)
, bench "sort a list of 10000 identical Texts (en)"
- (whnf (sortBy collateString) (replicate 10000 "ḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔ"))
+ (nf (sortBy collateString) (replicate 10000 "ḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔ"))
, bench "sort a list of 10000 random Strings (en)"
- (whnf (sortBy collateString) tenThousandString)
+ (nf (sortBy collateString) tenThousandString)
]
+-- A mix of CJK ideographs, all of which are assigned implicit weights
+-- (they are not listed individually in the DUCET), so this exercises
+-- 'calculateImplicitWeight'. Includes the "Core Han" path (CJK Unified
+-- Ideographs, BMP) as well as the "All Other Han Unified" path (Extension A
+-- and a sample of supplementary Extension B).
+cjkChars :: [Char]
+cjkChars = map chr $
+ [0x4E00..0x9FFF] -- CJK Unified Ideographs (BMP, Core Han)
+ ++ [0x3400..0x4DBF] -- CJK Extension A
+ ++ [0x20000..0x20FFF] -- sample of CJK Extension B (supplementary)
+
latinChars :: [Char]
latinChars =
"ḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿ‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…"
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/benchmark/SortKey.hs
new/unicode-collation-0.1.3.7/benchmark/SortKey.hs
--- old/unicode-collation-0.1.3.6/benchmark/SortKey.hs 1970-01-01
01:00:00.000000000 +0100
+++ new/unicode-collation-0.1.3.7/benchmark/SortKey.hs 2001-09-09
03:46:40.000000000 +0200
@@ -0,0 +1,107 @@
+{-# LANGUAGE OverloadedStrings #-}
+{-# LANGUAGE ScopedTypeVariables #-}
+
+-- | Benchmark comparing two ways of sorting with a 'Collator':
+--
+-- * naive: @sortBy (collate coll)@ -- recomputes both sort keys
+-- on every comparison
+-- * keyed: @sortOn (sortKey coll)@ -- computes each sort key once
+--
+-- Inputs are generated with a self-contained xorshift PRNG so this target
+-- depends only on the library, text, deepseq and tasty-bench (no text-icu /
+-- QuickCheck). Results are forced with 'nf' so the whole sort is measured.
+module Main (main) where
+
+import Test.Tasty.Bench
+import Control.DeepSeq (force)
+import Control.Exception (evaluate)
+import Data.Bits (xor, shiftL, shiftR)
+import Data.Char (chr)
+import Data.List (sortBy, sortOn)
+import Data.Word (Word64)
+import Data.Text (Text)
+import qualified Data.Text as T
+import Text.Collate
+
+-- ---------------------------------------------------------------------------
+-- A tiny deterministic PRNG (xorshift64), so benchmarks are reproducible
+-- and we avoid a QuickCheck dependency.
+
+xorshift :: Word64 -> Word64
+xorshift s0 =
+ let s1 = s0 `xor` (s0 `shiftL` 13)
+ s2 = s1 `xor` (s1 `shiftR` 7)
+ in s2 `xor` (s2 `shiftL` 17)
+
+randoms :: Word64 -> [Word64]
+randoms = drop 1 . iterate xorshift
+
+-- Map a Word64 to a Char in [lo,hi] (inclusive code point range).
+type CharGen = Word64 -> Char
+
+rangeGen :: Int -> Int -> CharGen
+rangeGen lo hi w = chr (lo + fromIntegral (w `mod` fromIntegral (hi - lo + 1)))
+
+-- Build n Texts. Each text has a length drawn from [minLen,maxLen] and
+-- characters produced by the given CharGen.
+buildTexts :: Int -> (Int, Int) -> CharGen -> [Word64] -> [Text]
+buildTexts 0 _ _ _ = []
+buildTexts n (lo, hi) g ws0 =
+ case ws0 of
+ [] -> []
+ (lenW : ws) ->
+ let len = lo + fromIntegral (lenW `mod` fromIntegral (hi - lo +
1))
+ (chunk, ws') = splitAt len ws
+ txt = T.pack (map g chunk)
+ in txt : buildTexts (n - 1) (lo, hi) g ws'
+
+-- A text whose first 25 chars are a shared prefix (worst case for the naive
+-- sort: every comparison must scan past the identical prefix before it can
+-- decide, forcing nearly the whole sort key each time).
+withSharedPrefix :: Text -> Text
+withSharedPrefix = ("The quick brown fox jumps" <>)
+
+-- ---------------------------------------------------------------------------
+-- The two sorting strategies.
+
+naiveSort :: Collator -> [Text] -> [Text]
+naiveSort coll = sortBy (collate coll)
+
+keyedSort :: Collator -> [Text] -> [Text]
+keyedSort coll = sortOn (sortKey coll)
+
+-- ---------------------------------------------------------------------------
+
+main :: IO ()
+main = do
+ let coll = collatorFor "en"
+ n = 10000
+
+ -- Four data shapes, each a forced list of 10000 Texts.
+ ascii <- evaluate $ force $
+ buildTexts n (1, 20) (rangeGen 0x61 0x7A) (randoms 0x1234)
+ latin <- evaluate $ force $ -- Latin Extended Additional: need NFD
+ buildTexts n (1, 20) (rangeGen 0x1E00 0x1EFF) (randoms 0x5678)
+ cjk <- evaluate $ force $ -- implicit weights path
+ buildTexts n (1, 20) (rangeGen 0x4E00 0x9FFF) (randoms 0x9ABC)
+ long <- evaluate $ force $ map withSharedPrefix $
+ buildTexts n (1, 20) (rangeGen 0x61 0x7A) (randoms 0xDEF0)
+
+ defaultMain
+ [ bgroup "ascii"
+ [ bench "naive (collate)" $ nf (naiveSort coll) ascii
+ , bench "keyed (sortKey)" $ nf (keyedSort coll) ascii
+ ]
+ , bgroup "latin (needs NFD)"
+ [ bench "naive (collate)" $ nf (naiveSort coll) latin
+ , bench "keyed (sortKey)" $ nf (keyedSort coll) latin
+ ]
+ , bgroup "cjk (implicit weights)"
+ [ bench "naive (collate)" $ nf (naiveSort coll) cjk
+ , bench "keyed (sortKey)" $ nf (keyedSort coll) cjk
+ ]
+ , bgroup "shared-prefix (worst case)"
+ [ bench "naive (collate)" $ nf (naiveSort coll) long
+ , bench "keyed (sortKey)" $ nf (keyedSort coll) long
+ ]
+ ]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore'
old/unicode-collation-0.1.3.6/src/Text/Collate/Collation.hs
new/unicode-collation-0.1.3.7/src/Text/Collate/Collation.hs
--- old/unicode-collation-0.1.3.6/src/Text/Collate/Collation.hs 2001-09-09
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/src/Text/Collate/Collation.hs 2001-09-09
03:46:40.000000000 +0200
@@ -19,7 +19,6 @@
)
where
-import qualified Data.IntSet as IntSet
import qualified Data.Text as T
import qualified Data.Text.Read as TR
import Data.Text (Text)
@@ -219,34 +218,37 @@
-- @implicitweights 18D00..18D8F; FB00 # Tangut Supplement
-- @implicitweights 1B170..1B2FF; FB01 # Nushu
-- @implicitweights 18B00..18CFF; FB02 # Khitan Small Script
+-- from PropList.txt in unicode data:
+isUnifiedIdeograph :: Int -> Bool
+isUnifiedIdeograph cp =
+ (cp >= 0x3400 && cp <= 0x4DBF)
+ || (cp >= 0x4E00 && cp <= 0x9FFC)
+ || (cp >= 0xFA0E && cp <= 0xFA0F)
+ || cp == 0xFA11
+ || (cp >= 0xFA13 && cp <= 0xFA14)
+ || cp == 0xFA1F
+ || cp == 0xFA21
+ || (cp >= 0xFA23 && cp <= 0xFA24)
+ || (cp >= 0xFA27 && cp <= 0xFA29)
+ || (cp >= 0x20000 && cp <= 0x2A6DD)
+ || (cp >= 0x2A700 && cp <= 0x2B734)
+ || (cp >= 0x2B740 && cp <= 0x2B81D)
+ || (cp >= 0x2B820 && cp <= 0x2CEA1)
+ || (cp >= 0x2CEB0 && cp <= 0x2EBE0)
+ || (cp >= 0x30000 && cp <= 0x3134A)
+
+-- from Blocks.txt in unicode data:
+isCjkCompatibilityIdeograph :: Int -> Bool
+isCjkCompatibilityIdeograph cp = cp >= 0xF900 && cp <= 0xFAFF
+
+isCjkUnifiedIdeograph :: Int -> Bool
+isCjkUnifiedIdeograph cp = cp >= 0x4E00 && cp <= 0x9FFF
+
calculateImplicitWeight :: Int -> [CollationElement]
calculateImplicitWeight cp =
[CollationElement False (fromIntegral aaaa) 0x0020 0x0002 0xFFFF,
CollationElement False (fromIntegral bbbb) 0 0 0xFFFF]
where
- range x y = IntSet.fromList [x..y]
- singleton = IntSet.singleton
- union = IntSet.union
- -- from PropList.txt in unicode data:
- unifiedIdeographs = range 0x3400 0x4DBF `union`
- range 0x4E00 0x9FFC `union`
- range 0xFA0E 0xFA0F `union`
- singleton 0xFA11 `union`
- range 0xFA13 0xFA14 `union`
- singleton 0xFA1F `union`
- singleton 0xFA21 `union`
- range 0xFA23 0xFA24 `union`
- range 0xFA27 0xFA29 `union`
- range 0x20000 0x2A6DD `union`
- range 0x2A700 0x2B734 `union`
- range 0x2B740 0x2B81D `union`
- range 0x2B820 0x2CEA1 `union`
- range 0x2CEB0 0x2EBE0 `union`
- range 0x2CEB0 0x2EBE0 `union`
- range 0x30000 0x3134A
- -- from Blocks.txt in unicode data:
- cjkCompatibilityIdeographs = range 0xF900 0xFAFF
- cjkUnifiedIdeographs = range 0x4E00 0x9FFF
(aaaa, bbbb) =
case cp of
_ | cp >= 0x17000 , cp <= 0x18AFF -- Tangut and Tangut Components
@@ -257,14 +259,14 @@
-> (0xFB01, (cp - 0x1B170) .|. 0x8000)
| cp >= 0x18B00 , cp <= 0x18CFF -- Khitan Small Script
-> (0xFB02, (cp - 0x18B00) .|. 0x8000)
- | cp `IntSet.member` unifiedIdeographs &&
- (cp `IntSet.member` cjkUnifiedIdeographs ||
- cp `IntSet.member` cjkCompatibilityIdeographs) -- Core Han
+ | isUnifiedIdeograph cp &&
+ (isCjkUnifiedIdeograph cp ||
+ isCjkCompatibilityIdeograph cp) -- Core Han
-> (0xFB40 + (cp `shiftR` 15), (cp .&. 0x7FFF) .|. 0x8000)
- | cp `IntSet.member` unifiedIdeographs -- All Other Han Unified ?
+ | isUnifiedIdeograph cp -- All Other Han Unified ?
-> (0xFB80 + (cp `shiftR` 15), (cp .&. 0x7FFF) .|. 0x8000)
| otherwise
- -> (0xFBC0 + (cp `shiftR` 15), (cp .&. 0x7FFFF) .|. 0x8000)
+ -> (0xFBC0 + (cp `shiftR` 15), (cp .&. 0x7FFF) .|. 0x8000)
-- | Parse a 'Collation' from a Text in the format of @allkeys.txt@.
parseCollation :: Text -> Collation
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/src/Text/Collate.hs
new/unicode-collation-0.1.3.7/src/Text/Collate.hs
--- old/unicode-collation-0.1.3.6/src/Text/Collate.hs 2001-09-09
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/src/Text/Collate.hs 2001-09-09
03:46:40.000000000 +0200
@@ -11,9 +11,9 @@
instance of 'Collator' (together with the @OverloadedStrings@
extension):
->>> import Data.List (sortBy)
+>>> import Data.List (sortOn)
>>> import qualified Data.Text.IO as T
->>> mapM_ T.putStrLn $ sortBy (collate "en-US")
["𝒶bc","abC","𝕒bc","Abc","abç","äbc"]
+>>> mapM_ T.putStrLn $ sortOn (sortKey "en-US")
["𝒶bc","abC","𝕒bc","Abc","abç","äbc"]
abC
𝒶bc
𝕒bc
@@ -34,8 +34,13 @@
𝕒bc
A 'Collator' provides a function 'collate' that compares two texts,
-and a function 'sortKey' that returns the sort key. Most users will
-just need 'collate'.
+and a function 'sortKey' that returns the sort key. Use 'collate'
+when you only need to compare two values. To sort a list, prefer
+@sortOn (sortKey collator)@ (as above) over @sortBy (collate collator)@:
+'collate' recomputes the sort key of each argument on every
+comparison, whereas 'sortOn' (from "Data.List") computes each key only
+once. This is substantially faster when sorting large lists (in
+benchmarks, roughly 3-5x).
>>> let de = collatorFor "de"
>>> let se = collatorFor "se"
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/test/unit.hs
new/unicode-collation-0.1.3.7/test/unit.hs
--- old/unicode-collation-0.1.3.6/test/unit.hs 2001-09-09 03:46:40.000000000
+0200
+++ new/unicode-collation-0.1.3.7/test/unit.hs 2001-09-09 03:46:40.000000000
+0200
@@ -21,11 +21,13 @@
main :: IO ()
main = do
conformanceTree <- conformanceTests
- defaultMain (tests conformanceTree)
+ coverageTree <- implicitWeightCoverageTests
+ defaultMain (tests conformanceTree coverageTree)
-tests :: TestTree -> TestTree
-tests conformanceTree = testGroup "Tests"
+tests :: TestTree -> TestTree -> TestTree
+tests conformanceTree coverageTree = testGroup "Tests"
[ conformanceTree
+ , coverageTree
, testCase "Sorting test 1" $
sortBy (collate ourCollator) ["hi", "hit", "hít", "hat", "hot",
"naïve", "nag", "name"] @?=
@@ -173,6 +175,87 @@
$ map (conformanceTestWith coll)
(zip3 (map fst xs) (map snd xs) (drop 1 (map snd xs)))
+-- | Guard against the hand-coded ideograph ranges in
+-- 'Text.Collate.Collation' (used by 'calculateImplicitWeight') drifting
+-- out of sync with the shipped Unicode data. Every code point that the
+-- data marks as an ideograph must receive an /ideographic/ implicit
+-- primary weight; anything not covered by the hand-coded ranges falls to
+-- the generic unassigned bucket at @0xFBC0@ and beyond. We enumerate the
+-- full set (not just range endpoints) so that adding a new ideograph block
+-- to the data without updating the ranges is caught.
+--
+-- Sources of truth, both in the shipped data:
+--
+-- * @\@implicitweights@ directives in @data\/allkeys.txt@
+-- (Tangut, Nushu, Khitan -> 0xFB00..0xFB02)
+-- * @\<... Ideograph ..., First\/Last\>@ markers in
+-- @data\/UnicodeData.txt@ (CJK Unified + extensions -> 0xFB40\/0xFB80)
+implicitWeightCoverageTests :: IO TestTree
+implicitWeightCoverageTests = do
+ putStrLn "Loading implicit-weight coverage data..."
+ allkeys <- B8.readFile "data/allkeys.txt"
+ ucd <- B8.readFile "data/UnicodeData.txt"
+ let ranges = implicitWeightDirectives allkeys ++ ideographRanges ucd
+ cps = concatMap (\(lo, hi) -> [lo .. hi]) ranges
+ return $ testCase "Implicit weights cover all ideographs in shipped data" $
+ case [ (cp, w) | cp <- cps, let w = implicitPrimary cp, w >= 0xFBC0 ] of
+ [] -> return ()
+ bad -> assertFailure $
+ "Code points marked as ideographs in the data receive the "
+ ++ "generic\n(>= 0xFBC0) implicit weight; the hand-coded ranges "
+ ++ "in\nText.Collate.Collation are stale (" ++ show (length bad)
+ ++ " affected):\n"
+ ++ unlines [ printf " U+%05X -> primary %04X" cp w
+ | (cp, w) <- take 40 bad ]
+
+-- | Primary weight assigned to a single code point (head of its sort key).
+implicitPrimary :: Int -> Int
+implicitPrimary cp =
+ case sortKey rootCollator (T.singleton (chr cp)) of
+ SortKey (w : _) -> fromIntegral w
+ SortKey [] -> 0
+
+-- | Parse the @\@implicitweights LO..HI; WEIGHT@ directives from
+-- @allkeys.txt@ into inclusive code-point ranges.
+implicitWeightDirectives :: B8.ByteString -> [(Int, Int)]
+implicitWeightDirectives = mapMaybe parseLine . B8.lines
+ where
+ parseLine ln =
+ let t = TE.decodeLatin1 ln
+ in if "@implicitweights" `T.isPrefixOf` t
+ then dotRange (T.takeWhile (/= ';') (T.drop 16 t))
+ else Nothing
+ dotRange t =
+ case T.splitOn ".." (T.strip t) of
+ [a, b] -> (,) <$> hexInt a <*> hexInt b
+ _ -> Nothing
+
+-- | Parse the @\<... Ideograph ..., First\/Last\>@ range markers from
+-- @UnicodeData.txt@ into inclusive code-point ranges.
+ideographRanges :: B8.ByteString -> [(Int, Int)]
+ideographRanges = pairUp . mapMaybe marker . B8.lines
+ where
+ marker ln =
+ case T.splitOn ";" (TE.decodeLatin1 ln) of
+ (cpF : nameF : _)
+ | "Ideograph" `T.isInfixOf` nameF
+ , Just cp <- hexInt cpF
+ -> if ", First>" `T.isSuffixOf` nameF then Just (cp, False)
+ else if ", Last>" `T.isSuffixOf` nameF then Just (cp, True)
+ else Nothing
+ _ -> Nothing
+ pairUp ((lo, False) : (hi, True) : rest) = (lo, hi) : pairUp rest
+ pairUp (_ : rest) = pairUp rest
+ pairUp [] = []
+
+-- | Parse a (possibly space-padded) hexadecimal integer, requiring that it
+-- consume the whole input.
+hexInt :: Text -> Maybe Int
+hexInt t =
+ case TR.hexadecimal (T.strip t) of
+ Right (n, rest) | T.null (T.strip rest) -> Just n
+ _ -> Nothing
+
conformanceTestWith :: Collator -> (Int, Text, Text) -> Either String ()
conformanceTestWith coll (lineNo, txt1, txt2) =
let showHexes = unwords . map ((\c -> if c > 0xFFFF
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn'
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/unicode-collation.cabal
new/unicode-collation-0.1.3.7/unicode-collation.cabal
--- old/unicode-collation-0.1.3.6/unicode-collation.cabal 2001-09-09
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/unicode-collation.cabal 2001-09-09
03:46:40.000000000 +0200
@@ -1,6 +1,6 @@
cabal-version: 2.2
name: unicode-collation
-version: 0.1.3.6
+version: 0.1.3.7
synopsis: Haskell implementation of the Unicode Collation Algorithm
description: This library provides a pure Haskell implementation of
the Unicode Collation Algorithm described at
@@ -35,6 +35,7 @@
GHC == 9.4.2
GHC == 9.6.3
GHC == 9.8.1
+ GHC == 9.10.1
source-repository head
type: git
@@ -49,8 +50,12 @@
Description: Build the unicode-collate executable.
Default: False
+flag icu-benchmark
+ Description: Build the text-icu comparison benchmark (needs ICU C
lib).
+ Default: False
+
common common-options
- build-depends: base >= 4.11 && < 4.20
+ build-depends: base >= 4.11 && < 4.23
ghc-options: -Wall
-Wcompat
@@ -150,3 +155,18 @@
, quickcheck-instances
, QuickCheck
ghc-options: -rtsopts -with-rtsopts=-A8m
+ if flag(icu-benchmark)
+ buildable: True
+ else
+ buildable: False
+
+benchmark bench-sortkey
+ import: common-options
+ type: exitcode-stdio-1.0
+ hs-source-dirs: benchmark
+ main-is: SortKey.hs
+ build-depends: tasty-bench
+ , unicode-collation
+ , text
+ , deepseq
+ ghc-options: -rtsopts -with-rtsopts=-A8m