commit ghc-unicode-collation for openSUSE:Factory

Source-Sync Wed, 10 Jun 2026 07:14:18 -0700

Script 'mail_helper' called by obssrc
Hello community,

here is the log from the commit of package ghc-unicode-collation for 
openSUSE:Factory checked in at 2026-06-10 16:09:08
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Comparing /work/SRC/openSUSE:Factory/ghc-unicode-collation (Old)
 and      /work/SRC/openSUSE:Factory/.ghc-unicode-collation.new.2375 (New)
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++


Package is "ghc-unicode-collation"

Wed Jun 10 16:09:08 2026 rev:10 rq:1358467 version:0.1.3.7

Changes:
--------
--- 
/work/SRC/openSUSE:Factory/ghc-unicode-collation/ghc-unicode-collation.changes  
    2025-01-28 16:41:23.587582668 +0100
+++ 
/work/SRC/openSUSE:Factory/.ghc-unicode-collation.new.2375/ghc-unicode-collation.changes
    2026-06-10 16:13:57.668455023 +0200
@@ -1,0 +2,54 @@
+Sat Jun  6 10:48:22 UTC 2026 - Peter Simons <[email protected]>
+
+- Update unicode-collation to version 0.1.3.7.
+  ## 0.1.3.7
+
+    * Docs: recommend `sortOn (sortKey c)` over `sortBy (collate c)`.
+      `collate` recomputes the sort key of each argument on every
+      comparison, whereas `sortOn` computes each key only once.  Update
+      the headline example and guidance accordingly.
+
+    * Replace IntSet lookups with range checks in implicit weight calculation.
+      `calculateImplicitWeight` previously tested CJK/ideograph
+      membership using large IntSets defined in its where-clause.
+      Since the data is essentially contiguous ranges, replace the
+      IntSets with simple range-comparison predicates. This removes
+      is ~25% faster on CJK-heavy sorting at -O1. Also drop a
+      duplicate range, remove the now-unused Data.IntSet import,
+      and correct a misleading (harmless, due to Word16 truncation)
+      0x7FFFF mask to 0x7FFF in the fallback case.
+
+    * Test: guard implicit-weight ideograph ranges against data drift.
+      The CJK/Tangut/Nushu/Khitan ranges in Text.Collate.Collation are
+      hand-coded and decoupled from the shipped Unicode data.  Add a test
+      that enumerates every code point the data marks as an ideograph
+      (from the @implicitweights directives in allkeys.txt and the CJK
+      Ideograph First/Last markers in UnicodeData.txt) and asserts each
+      receives an ideographic implicit primary weight (< 0xFBC0) rather
+      than the generic unassigned bucket.  A future data bump that does
+      not update the ranges will now fail, naming the uncovered points.
+
+    * Benchmarks: force full sorts with `nf` instead of `whnf`.
+      `whnf` only evaluated the result list to its first cons cell, so the
+      benchmarks measured finding the minimum rather than a full sort. Using
+      `nf` forces the entire sorted list, giving a representative measurement.
+
+    * Add `bench-sortkey` benchmark comparing `collate` with `sortKey`.
+      This self-contained benchmark (no text-icu/QuickCheck deps) measures
+      `sortBy (collate c)` against `sortOn (sortKey c)` across several input
+      shapes, demonstrating the speedup from computing each sort key once.
+
+    * Add an `icu-benchmark` flag (default off) guarding the existing
+      text-icu benchmark, so the new benchmark can be built without the ICU
+      C library.
+
+    * Add a CJK-focused benchmark exercising the implicit weight path.
+
+    * Add flake.nix.
+
+    * CI: Support GHC 9.10 (David Binder), 9.12 (Austin Ziegler), and
+      9.14 (Li-yao Xia).
+
+    * Add cabal.project.
+
+-------------------------------------------------------------------

Old:
----
  unicode-collation-0.1.3.6.tar.gz
  unicode-collation.cabal

New:
----
  unicode-collation-0.1.3.7.tar.gz

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Other differences:
------------------
++++++ ghc-unicode-collation.spec ++++++
--- /var/tmp/diff_new_pack.vtylYA/_old  2026-06-10 16:14:02.100638694 +0200
+++ /var/tmp/diff_new_pack.vtylYA/_new  2026-06-10 16:14:02.124639688 +0200
@@ -1,7 +1,7 @@
 #
 # spec file for package ghc-unicode-collation
 #
-# Copyright (c) 2025 SUSE LLC
+# Copyright (c) 2026 SUSE LLC
 #
 # All modifications and additions to the file contributed by third parties
 # remain the property of their copyright owners, unless otherwise agreed
@@ -20,13 +20,12 @@
 %global pkgver %{pkg_name}-%{version}
 %bcond_with tests
 Name:           ghc-%{pkg_name}
-Version:        0.1.3.6
+Version:        0.1.3.7
 Release:        0
 Summary:        Haskell implementation of the Unicode Collation Algorithm
 License:        BSD-2-Clause
 URL:            https://hackage.haskell.org/package/%{pkg_name}
 Source0:        
https://hackage.haskell.org/package/%{pkg_name}-%{version}/%{pkg_name}-%{version}.tar.gz
-Source1:        
https://hackage.haskell.org/package/%{pkg_name}-%{version}/revision/2.cabal#/%{pkg_name}.cabal
 BuildRequires:  ghc-Cabal-devel
 BuildRequires:  ghc-base-devel
 BuildRequires:  ghc-base-prof
@@ -92,7 +91,6 @@
 
 %prep
 %autosetup -n %{pkg_name}-%{version}
-cp -p %{SOURCE1} %{pkg_name}.cabal
 
 %build
 %ghc_lib_build

++++++ unicode-collation-0.1.3.6.tar.gz -> unicode-collation-0.1.3.7.tar.gz 
++++++
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/CHANGELOG.md 
new/unicode-collation-0.1.3.7/CHANGELOG.md
--- old/unicode-collation-0.1.3.6/CHANGELOG.md  2001-09-09 03:46:40.000000000 
+0200
+++ new/unicode-collation-0.1.3.7/CHANGELOG.md  2001-09-09 03:46:40.000000000 
+0200
@@ -2,6 +2,56 @@
 
 `unicode-collation` uses [PVP Versioning](https://pvp.haskell.org).
 
+## 0.1.3.7
+
+  * Docs: recommend `sortOn (sortKey c)` over `sortBy (collate c)`.
+    `collate` recomputes the sort key of each argument on every
+    comparison, whereas `sortOn` computes each key only once.  Update
+    the headline example and guidance accordingly.
+
+  * Replace IntSet lookups with range checks in implicit weight calculation.
+    `calculateImplicitWeight` previously tested CJK/ideograph
+    membership using large IntSets defined in its where-clause.
+    Since the data is essentially contiguous ranges, replace the
+    IntSets with simple range-comparison predicates. This removes
+    is ~25% faster on CJK-heavy sorting at -O1. Also drop a
+    duplicate range, remove the now-unused Data.IntSet import,
+    and correct a misleading (harmless, due to Word16 truncation)
+    0x7FFFF mask to 0x7FFF in the fallback case.
+
+  * Test: guard implicit-weight ideograph ranges against data drift.
+    The CJK/Tangut/Nushu/Khitan ranges in Text.Collate.Collation are
+    hand-coded and decoupled from the shipped Unicode data.  Add a test
+    that enumerates every code point the data marks as an ideograph
+    (from the @implicitweights directives in allkeys.txt and the CJK
+    Ideograph First/Last markers in UnicodeData.txt) and asserts each
+    receives an ideographic implicit primary weight (< 0xFBC0) rather
+    than the generic unassigned bucket.  A future data bump that does
+    not update the ranges will now fail, naming the uncovered points.
+
+  * Benchmarks: force full sorts with `nf` instead of `whnf`.
+    `whnf` only evaluated the result list to its first cons cell, so the
+    benchmarks measured finding the minimum rather than a full sort. Using
+    `nf` forces the entire sorted list, giving a representative measurement.
+
+  * Add `bench-sortkey` benchmark comparing `collate` with `sortKey`.
+    This self-contained benchmark (no text-icu/QuickCheck deps) measures
+    `sortBy (collate c)` against `sortOn (sortKey c)` across several input
+    shapes, demonstrating the speedup from computing each sort key once.
+
+  * Add an `icu-benchmark` flag (default off) guarding the existing
+    text-icu benchmark, so the new benchmark can be built without the ICU
+    C library.
+
+  * Add a CJK-focused benchmark exercising the implicit weight path.
+
+  * Add flake.nix.
+
+  * CI: Support GHC 9.10 (David Binder), 9.12 (Austin Ziegler), and
+    9.14 (Li-yao Xia).
+
+  * Add cabal.project.
+
 ## 0.1.3.6
 
   * Update to build with GHC 9.8 (Laurent P. René de Cotret).
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/benchmark/Main.hs 
new/unicode-collation-0.1.3.7/benchmark/Main.hs
--- old/unicode-collation-0.1.3.6/benchmark/Main.hs     2001-09-09 
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/benchmark/Main.hs     2001-09-09 
03:46:40.000000000 +0200
@@ -19,10 +19,13 @@
   (randomTexts :: [Text]) <- generate (infiniteListOf arbitrary)
   (randomLatinStrings :: [String]) <-
       generate (infiniteListOf (listOf (elements latinChars)))
+  (randomCJKStrings :: [String]) <-
+      generate (infiniteListOf (listOf (elements cjkChars)))
   (randomAsciiTexts :: [Text]) <-
     generate (infiniteListOf (arbitrary `suchThat` T.all isAscii))
   let tenThousand = take 10000 randomTexts
   let tenThousandLatin = map T.pack $ take 10000 randomLatinStrings
+  let tenThousandCJK = map T.pack $ take 10000 randomCJKStrings
   let tenThousandLatinNFD = map (T.pack . map chr . toNFD . map ord . T.unpack)
                               tenThousandLatin
   let tenThousandString = map T.unpack tenThousand
@@ -33,28 +36,43 @@
   let collateString = collateWithUnpacker (collatorFor "en") id
   defaultMain
     [ bench "sort a list of 10000 random Texts (en)"
-        (whnf (sortBy (collate (collatorFor "en"))) tenThousand)
+        (nf (sortBy (collate (collatorFor "en"))) tenThousand)
     , bench "sort same list with text-icu (en)"
-        (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousand)
+        (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousand)
     , bench "sort a list of 10000 Texts (composed latin) (en)"
-        (whnf (sortBy (collate (collatorFor "en"))) tenThousandLatin)
+        (nf (sortBy (collate (collatorFor "en"))) tenThousandLatin)
     , bench "sort same list with text-icu (en)"
-        (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLatin)
+        (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLatin)
     , bench "sort same list but pre-normalized (en-u-kk-false)"
-        (whnf (sortBy (collate (collatorFor "en-u-kk-false"))) 
tenThousandLatinNFD)
+        (nf (sortBy (collate (collatorFor "en-u-kk-false"))) 
tenThousandLatinNFD)
+    , bench "sort a list of 10000 CJK Texts (en, implicit weights)"
+        (nf (sortBy (collate (collatorFor "en"))) tenThousandCJK)
+    , bench "sort same CJK list with text-icu (en)"
+        (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandCJK)
     , bench "sort a list of 10000 ASCII Texts (en)"
-        (whnf (sortBy (collate (collatorFor "en"))) tenThousandAscii)
+        (nf (sortBy (collate (collatorFor "en"))) tenThousandAscii)
     , bench "sort same list with text-icu (en)"
-        (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousandAscii)
+        (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandAscii)
     , bench "sort a list of 10000 random Texts that agree in first 32 chars"
-        (whnf (sortBy (collate (collatorFor "en"))) tenThousandLong)
+        (nf (sortBy (collate (collatorFor "en"))) tenThousandLong)
     , bench "sort same list with text-icu (en)"
-        (whnf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLong)
+        (nf (sortBy (ICU.collate (icuCollator "en"))) tenThousandLong)
     , bench "sort a list of 10000 identical Texts (en)"
-        (whnf (sortBy collateString) (replicate 10000 "ḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔ"))
+        (nf (sortBy collateString) (replicate 10000 "ḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔ"))
     , bench "sort a list of 10000 random Strings (en)"
-        (whnf (sortBy collateString) tenThousandString)
+        (nf (sortBy collateString) tenThousandString)
     ]
 
+-- A mix of CJK ideographs, all of which are assigned implicit weights
+-- (they are not listed individually in the DUCET), so this exercises
+-- 'calculateImplicitWeight'.  Includes the "Core Han" path (CJK Unified
+-- Ideographs, BMP) as well as the "All Other Han Unified" path (Extension A
+-- and a sample of supplementary Extension B).
+cjkChars :: [Char]
+cjkChars = map chr $
+     [0x4E00..0x9FFF]    -- CJK Unified Ideographs (BMP, Core Han)
+  ++ [0x3400..0x4DBF]    -- CJK Extension A
+  ++ [0x20000..0x20FFF]  -- sample of CJK Extension B (supplementary)
+
 latinChars :: [Char]
 latinChars = 
"ḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿ‐‑‒–—―‖‗‘’‚‛“”„‟†‡•‣․‥…"
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/benchmark/SortKey.hs 
new/unicode-collation-0.1.3.7/benchmark/SortKey.hs
--- old/unicode-collation-0.1.3.6/benchmark/SortKey.hs  1970-01-01 
01:00:00.000000000 +0100
+++ new/unicode-collation-0.1.3.7/benchmark/SortKey.hs  2001-09-09 
03:46:40.000000000 +0200
@@ -0,0 +1,107 @@
+{-# LANGUAGE OverloadedStrings   #-}
+{-# LANGUAGE ScopedTypeVariables #-}
+
+-- | Benchmark comparing two ways of sorting with a 'Collator':
+--
+--   * naive:   @sortBy (collate coll)@         -- recomputes both sort keys
+--                                                 on every comparison
+--   * keyed:   @sortOn (sortKey coll)@         -- computes each sort key once
+--
+-- Inputs are generated with a self-contained xorshift PRNG so this target
+-- depends only on the library, text, deepseq and tasty-bench (no text-icu /
+-- QuickCheck).  Results are forced with 'nf' so the whole sort is measured.
+module Main (main) where
+
+import Test.Tasty.Bench
+import Control.DeepSeq (force)
+import Control.Exception (evaluate)
+import Data.Bits (xor, shiftL, shiftR)
+import Data.Char (chr)
+import Data.List (sortBy, sortOn)
+import Data.Word (Word64)
+import Data.Text (Text)
+import qualified Data.Text as T
+import Text.Collate
+
+-- ---------------------------------------------------------------------------
+-- A tiny deterministic PRNG (xorshift64), so benchmarks are reproducible
+-- and we avoid a QuickCheck dependency.
+
+xorshift :: Word64 -> Word64
+xorshift s0 =
+  let s1 = s0 `xor` (s0 `shiftL` 13)
+      s2 = s1 `xor` (s1 `shiftR` 7)
+  in  s2 `xor` (s2 `shiftL` 17)
+
+randoms :: Word64 -> [Word64]
+randoms = drop 1 . iterate xorshift
+
+-- Map a Word64 to a Char in [lo,hi] (inclusive code point range).
+type CharGen = Word64 -> Char
+
+rangeGen :: Int -> Int -> CharGen
+rangeGen lo hi w = chr (lo + fromIntegral (w `mod` fromIntegral (hi - lo + 1)))
+
+-- Build n Texts. Each text has a length drawn from [minLen,maxLen] and
+-- characters produced by the given CharGen.
+buildTexts :: Int -> (Int, Int) -> CharGen -> [Word64] -> [Text]
+buildTexts 0 _ _ _ = []
+buildTexts n (lo, hi) g ws0 =
+  case ws0 of
+    [] -> []
+    (lenW : ws) ->
+      let len          = lo + fromIntegral (lenW `mod` fromIntegral (hi - lo + 
1))
+          (chunk, ws') = splitAt len ws
+          txt          = T.pack (map g chunk)
+      in txt : buildTexts (n - 1) (lo, hi) g ws'
+
+-- A text whose first 25 chars are a shared prefix (worst case for the naive
+-- sort: every comparison must scan past the identical prefix before it can
+-- decide, forcing nearly the whole sort key each time).
+withSharedPrefix :: Text -> Text
+withSharedPrefix = ("The quick brown fox jumps" <>)
+
+-- ---------------------------------------------------------------------------
+-- The two sorting strategies.
+
+naiveSort :: Collator -> [Text] -> [Text]
+naiveSort coll = sortBy (collate coll)
+
+keyedSort :: Collator -> [Text] -> [Text]
+keyedSort coll = sortOn (sortKey coll)
+
+-- ---------------------------------------------------------------------------
+
+main :: IO ()
+main = do
+  let coll = collatorFor "en"
+      n    = 10000
+
+  -- Four data shapes, each a forced list of 10000 Texts.
+  ascii <- evaluate $ force $
+             buildTexts n (1, 20) (rangeGen 0x61 0x7A) (randoms 0x1234)
+  latin <- evaluate $ force $   -- Latin Extended Additional: need NFD
+             buildTexts n (1, 20) (rangeGen 0x1E00 0x1EFF) (randoms 0x5678)
+  cjk   <- evaluate $ force $   -- implicit weights path
+             buildTexts n (1, 20) (rangeGen 0x4E00 0x9FFF) (randoms 0x9ABC)
+  long  <- evaluate $ force $ map withSharedPrefix $
+             buildTexts n (1, 20) (rangeGen 0x61 0x7A) (randoms 0xDEF0)
+
+  defaultMain
+    [ bgroup "ascii"
+        [ bench "naive  (collate)" $ nf (naiveSort coll) ascii
+        , bench "keyed  (sortKey)" $ nf (keyedSort coll) ascii
+        ]
+    , bgroup "latin (needs NFD)"
+        [ bench "naive  (collate)" $ nf (naiveSort coll) latin
+        , bench "keyed  (sortKey)" $ nf (keyedSort coll) latin
+        ]
+    , bgroup "cjk (implicit weights)"
+        [ bench "naive  (collate)" $ nf (naiveSort coll) cjk
+        , bench "keyed  (sortKey)" $ nf (keyedSort coll) cjk
+        ]
+    , bgroup "shared-prefix (worst case)"
+        [ bench "naive  (collate)" $ nf (naiveSort coll) long
+        , bench "keyed  (sortKey)" $ nf (keyedSort coll) long
+        ]
+    ]
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' 
old/unicode-collation-0.1.3.6/src/Text/Collate/Collation.hs 
new/unicode-collation-0.1.3.7/src/Text/Collate/Collation.hs
--- old/unicode-collation-0.1.3.6/src/Text/Collate/Collation.hs 2001-09-09 
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/src/Text/Collate/Collation.hs 2001-09-09 
03:46:40.000000000 +0200
@@ -19,7 +19,6 @@
  )
 where
 
-import qualified Data.IntSet as IntSet
 import qualified Data.Text as T
 import qualified Data.Text.Read as TR
 import Data.Text (Text)
@@ -219,34 +218,37 @@
 -- @implicitweights 18D00..18D8F; FB00 # Tangut Supplement
 -- @implicitweights 1B170..1B2FF; FB01 # Nushu
 -- @implicitweights 18B00..18CFF; FB02 # Khitan Small Script
+-- from PropList.txt in unicode data:
+isUnifiedIdeograph :: Int -> Bool
+isUnifiedIdeograph cp =
+     (cp >= 0x3400  && cp <= 0x4DBF)
+  || (cp >= 0x4E00  && cp <= 0x9FFC)
+  || (cp >= 0xFA0E  && cp <= 0xFA0F)
+  || cp == 0xFA11
+  || (cp >= 0xFA13  && cp <= 0xFA14)
+  || cp == 0xFA1F
+  || cp == 0xFA21
+  || (cp >= 0xFA23  && cp <= 0xFA24)
+  || (cp >= 0xFA27  && cp <= 0xFA29)
+  || (cp >= 0x20000 && cp <= 0x2A6DD)
+  || (cp >= 0x2A700 && cp <= 0x2B734)
+  || (cp >= 0x2B740 && cp <= 0x2B81D)
+  || (cp >= 0x2B820 && cp <= 0x2CEA1)
+  || (cp >= 0x2CEB0 && cp <= 0x2EBE0)
+  || (cp >= 0x30000 && cp <= 0x3134A)
+
+-- from Blocks.txt in unicode data:
+isCjkCompatibilityIdeograph :: Int -> Bool
+isCjkCompatibilityIdeograph cp = cp >= 0xF900 && cp <= 0xFAFF
+
+isCjkUnifiedIdeograph :: Int -> Bool
+isCjkUnifiedIdeograph cp = cp >= 0x4E00 && cp <= 0x9FFF
+
 calculateImplicitWeight :: Int -> [CollationElement]
 calculateImplicitWeight cp =
   [CollationElement False (fromIntegral aaaa) 0x0020 0x0002 0xFFFF,
    CollationElement False (fromIntegral bbbb) 0 0 0xFFFF]
  where
-  range x y = IntSet.fromList [x..y]
-  singleton = IntSet.singleton
-  union = IntSet.union
-  -- from PropList.txt in unicode data:
-  unifiedIdeographs =    range 0x3400 0x4DBF `union`
-                         range 0x4E00 0x9FFC `union`
-                         range 0xFA0E 0xFA0F `union`
-                         singleton 0xFA11 `union`
-                         range 0xFA13 0xFA14 `union`
-                         singleton 0xFA1F `union`
-                         singleton 0xFA21 `union`
-                         range 0xFA23 0xFA24 `union`
-                         range 0xFA27 0xFA29 `union`
-                         range 0x20000 0x2A6DD `union`
-                         range 0x2A700 0x2B734 `union`
-                         range 0x2B740 0x2B81D `union`
-                         range 0x2B820 0x2CEA1 `union`
-                         range 0x2CEB0 0x2EBE0 `union`
-                         range 0x2CEB0 0x2EBE0 `union`
-                         range 0x30000 0x3134A
-  -- from Blocks.txt in unicode data:
-  cjkCompatibilityIdeographs = range 0xF900 0xFAFF
-  cjkUnifiedIdeographs = range 0x4E00 0x9FFF
   (aaaa, bbbb) =
     case cp of
     _ | cp >= 0x17000 , cp <= 0x18AFF -- Tangut and Tangut Components
@@ -257,14 +259,14 @@
         -> (0xFB01, (cp - 0x1B170) .|. 0x8000)
       | cp >= 0x18B00 , cp <= 0x18CFF -- Khitan Small Script
         -> (0xFB02, (cp - 0x18B00) .|. 0x8000)
-      | cp `IntSet.member` unifiedIdeographs &&
-        (cp `IntSet.member` cjkUnifiedIdeographs ||
-         cp `IntSet.member` cjkCompatibilityIdeographs)  -- Core Han
+      | isUnifiedIdeograph cp &&
+        (isCjkUnifiedIdeograph cp ||
+         isCjkCompatibilityIdeograph cp)  -- Core Han
         -> (0xFB40 + (cp `shiftR` 15), (cp .&. 0x7FFF) .|. 0x8000)
-      | cp `IntSet.member` unifiedIdeographs -- All Other Han Unified ?
+      | isUnifiedIdeograph cp -- All Other Han Unified ?
         -> (0xFB80 + (cp `shiftR` 15), (cp .&. 0x7FFF) .|. 0x8000)
       | otherwise
-        -> (0xFBC0 + (cp `shiftR` 15), (cp .&. 0x7FFFF) .|. 0x8000)
+        -> (0xFBC0 + (cp `shiftR` 15), (cp .&. 0x7FFF) .|. 0x8000)
 
 -- | Parse a 'Collation' from a Text in the format of @allkeys.txt@.
 parseCollation :: Text -> Collation
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/src/Text/Collate.hs 
new/unicode-collation-0.1.3.7/src/Text/Collate.hs
--- old/unicode-collation-0.1.3.6/src/Text/Collate.hs   2001-09-09 
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/src/Text/Collate.hs   2001-09-09 
03:46:40.000000000 +0200
@@ -11,9 +11,9 @@
 instance of 'Collator' (together with the @OverloadedStrings@
 extension):
 
->>> import Data.List (sortBy)
+>>> import Data.List (sortOn)
 >>> import qualified Data.Text.IO as T
->>> mapM_ T.putStrLn $ sortBy (collate "en-US") 
["𝒶bc","abC","𝕒bc","Abc","abç","äbc"]
+>>> mapM_ T.putStrLn $ sortOn (sortKey "en-US") 
["𝒶bc","abC","𝕒bc","Abc","abç","äbc"]
 abC
 𝒶bc
 𝕒bc
@@ -34,8 +34,13 @@
 𝕒bc
 
 A 'Collator' provides a function 'collate' that compares two texts,
-and a function 'sortKey' that returns the sort key.  Most users will
-just need 'collate'.
+and a function 'sortKey' that returns the sort key.  Use 'collate'
+when you only need to compare two values.  To sort a list, prefer
+@sortOn (sortKey collator)@ (as above) over @sortBy (collate collator)@:
+'collate' recomputes the sort key of each argument on every
+comparison, whereas 'sortOn' (from "Data.List") computes each key only
+once.  This is substantially faster when sorting large lists (in
+benchmarks, roughly 3-5x).
 
 >>> let de = collatorFor "de"
 >>> let se = collatorFor "se"
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/test/unit.hs 
new/unicode-collation-0.1.3.7/test/unit.hs
--- old/unicode-collation-0.1.3.6/test/unit.hs  2001-09-09 03:46:40.000000000 
+0200
+++ new/unicode-collation-0.1.3.7/test/unit.hs  2001-09-09 03:46:40.000000000 
+0200
@@ -21,11 +21,13 @@
 main :: IO ()
 main = do
   conformanceTree <- conformanceTests
-  defaultMain (tests conformanceTree)
+  coverageTree <- implicitWeightCoverageTests
+  defaultMain (tests conformanceTree coverageTree)
 
-tests :: TestTree -> TestTree
-tests conformanceTree = testGroup "Tests"
+tests :: TestTree -> TestTree -> TestTree
+tests conformanceTree coverageTree = testGroup "Tests"
   [ conformanceTree
+  , coverageTree
   , testCase "Sorting test 1" $
     sortBy (collate ourCollator) ["hi", "hit", "hít", "hat", "hot",
                        "naïve", "nag", "name"] @?=
@@ -173,6 +175,87 @@
          $ map (conformanceTestWith coll)
               (zip3 (map fst xs) (map snd xs) (drop 1 (map snd xs)))
 
+-- | Guard against the hand-coded ideograph ranges in
+-- 'Text.Collate.Collation' (used by 'calculateImplicitWeight') drifting
+-- out of sync with the shipped Unicode data.  Every code point that the
+-- data marks as an ideograph must receive an /ideographic/ implicit
+-- primary weight; anything not covered by the hand-coded ranges falls to
+-- the generic unassigned bucket at @0xFBC0@ and beyond.  We enumerate the
+-- full set (not just range endpoints) so that adding a new ideograph block
+-- to the data without updating the ranges is caught.
+--
+-- Sources of truth, both in the shipped data:
+--
+--   * @\@implicitweights@ directives in @data\/allkeys.txt@
+--     (Tangut, Nushu, Khitan -> 0xFB00..0xFB02)
+--   * @\<... Ideograph ..., First\/Last\>@ markers in
+--     @data\/UnicodeData.txt@ (CJK Unified + extensions -> 0xFB40\/0xFB80)
+implicitWeightCoverageTests :: IO TestTree
+implicitWeightCoverageTests = do
+  putStrLn "Loading implicit-weight coverage data..."
+  allkeys <- B8.readFile "data/allkeys.txt"
+  ucd     <- B8.readFile "data/UnicodeData.txt"
+  let ranges = implicitWeightDirectives allkeys ++ ideographRanges ucd
+      cps    = concatMap (\(lo, hi) -> [lo .. hi]) ranges
+  return $ testCase "Implicit weights cover all ideographs in shipped data" $
+    case [ (cp, w) | cp <- cps, let w = implicitPrimary cp, w >= 0xFBC0 ] of
+      []  -> return ()
+      bad -> assertFailure $
+               "Code points marked as ideographs in the data receive the "
+            ++ "generic\n(>= 0xFBC0) implicit weight; the hand-coded ranges "
+            ++ "in\nText.Collate.Collation are stale (" ++ show (length bad)
+            ++ " affected):\n"
+            ++ unlines [ printf "  U+%05X -> primary %04X" cp w
+                       | (cp, w) <- take 40 bad ]
+
+-- | Primary weight assigned to a single code point (head of its sort key).
+implicitPrimary :: Int -> Int
+implicitPrimary cp =
+  case sortKey rootCollator (T.singleton (chr cp)) of
+    SortKey (w : _) -> fromIntegral w
+    SortKey []      -> 0
+
+-- | Parse the @\@implicitweights LO..HI; WEIGHT@ directives from
+-- @allkeys.txt@ into inclusive code-point ranges.
+implicitWeightDirectives :: B8.ByteString -> [(Int, Int)]
+implicitWeightDirectives = mapMaybe parseLine . B8.lines
+ where
+  parseLine ln =
+    let t = TE.decodeLatin1 ln
+     in if "@implicitweights" `T.isPrefixOf` t
+           then dotRange (T.takeWhile (/= ';') (T.drop 16 t))
+           else Nothing
+  dotRange t =
+    case T.splitOn ".." (T.strip t) of
+      [a, b] -> (,) <$> hexInt a <*> hexInt b
+      _      -> Nothing
+
+-- | Parse the @\<... Ideograph ..., First\/Last\>@ range markers from
+-- @UnicodeData.txt@ into inclusive code-point ranges.
+ideographRanges :: B8.ByteString -> [(Int, Int)]
+ideographRanges = pairUp . mapMaybe marker . B8.lines
+ where
+  marker ln =
+    case T.splitOn ";" (TE.decodeLatin1 ln) of
+      (cpF : nameF : _)
+        | "Ideograph" `T.isInfixOf` nameF
+        , Just cp <- hexInt cpF
+        -> if      ", First>" `T.isSuffixOf` nameF then Just (cp, False)
+           else if ", Last>"  `T.isSuffixOf` nameF then Just (cp, True)
+           else Nothing
+      _ -> Nothing
+  pairUp ((lo, False) : (hi, True) : rest) = (lo, hi) : pairUp rest
+  pairUp (_ : rest)                        = pairUp rest
+  pairUp []                                = []
+
+-- | Parse a (possibly space-padded) hexadecimal integer, requiring that it
+-- consume the whole input.
+hexInt :: Text -> Maybe Int
+hexInt t =
+  case TR.hexadecimal (T.strip t) of
+    Right (n, rest) | T.null (T.strip rest) -> Just n
+    _                                       -> Nothing
+
 conformanceTestWith :: Collator -> (Int, Text, Text) -> Either String ()
 conformanceTestWith coll (lineNo, txt1, txt2) =
   let showHexes = unwords . map ((\c -> if c > 0xFFFF
diff -urN '--exclude=CVS' '--exclude=.cvsignore' '--exclude=.svn' 
'--exclude=.svnignore' old/unicode-collation-0.1.3.6/unicode-collation.cabal 
new/unicode-collation-0.1.3.7/unicode-collation.cabal
--- old/unicode-collation-0.1.3.6/unicode-collation.cabal       2001-09-09 
03:46:40.000000000 +0200
+++ new/unicode-collation-0.1.3.7/unicode-collation.cabal       2001-09-09 
03:46:40.000000000 +0200
@@ -1,6 +1,6 @@
 cabal-version:       2.2
 name:                unicode-collation
-version:             0.1.3.6
+version:             0.1.3.7
 synopsis:            Haskell implementation of the Unicode Collation Algorithm
 description:         This library provides a pure Haskell implementation of
                      the Unicode Collation Algorithm described at
@@ -35,6 +35,7 @@
                      GHC == 9.4.2
                      GHC == 9.6.3
                      GHC == 9.8.1
+                     GHC == 9.10.1
 
 source-repository head
   type:                git
@@ -49,8 +50,12 @@
   Description:         Build the unicode-collate executable.
   Default:             False
 
+flag icu-benchmark
+  Description:         Build the text-icu comparison benchmark (needs ICU C 
lib).
+  Default:             False
+
 common common-options
-  build-depends:       base >= 4.11 && < 4.20
+  build-depends:       base >= 4.11 && < 4.23
 
   ghc-options:         -Wall
                        -Wcompat
@@ -150,3 +155,18 @@
                      , quickcheck-instances
                      , QuickCheck
   ghc-options:         -rtsopts -with-rtsopts=-A8m
+  if flag(icu-benchmark)
+     buildable:        True
+  else
+     buildable:        False
+
+benchmark bench-sortkey
+  import:              common-options
+  type:                exitcode-stdio-1.0
+  hs-source-dirs:      benchmark
+  main-is:             SortKey.hs
+  build-depends:       tasty-bench
+                     , unicode-collation
+                     , text
+                     , deepseq
+  ghc-options:         -rtsopts -with-rtsopts=-A8m

commit ghc-unicode-collation for openSUSE:Factory

Reply via email to