From f6d38345d4aab160096c46689a9e79d5487e7438 Mon Sep 17 00:00:00 2001
From: Reuben Thomas <rrt@sc3d.org>
Date: Mon, 8 Mar 2010 15:41:28 +0000
Subject: [PATCH] Merge TODO with TODO section of
 http://www.gnu.org/software/grep/devel.html and remove done items.
 Some small bits of tidying also.

---
 TODO |  372 ++++++++++++++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 306 insertions(+), 66 deletions(-)

diff --git a/TODO b/TODO
index f9936c1..62e302e 100644
--- a/TODO
+++ b/TODO
@@ -6,16 +6,12 @@
 
 Get sane performance with UTF-8 locales.
 
-1) rewrite the configure.in script, perhaps also Makefile.am
-2) set up for gnulib-tool --import
-3) improve the test infrastructure
-4) check in the patches for the sync of dfa.c with GNU awk
-5) other small patches which wait for a test case
-6) process the Fedora/Red Hat patches
-7) some _minimal_ cleanup of the grep(), grepdir(), recursion
-   (the "main loop") and fix --directories=read
+Improve the test infrastructure.
 
-##
+Other small patches which wait for a test case.
+
+Some _minimal_ cleanup of the grep(), grepdir(), recursion (the "main
+loop") and fix --directories=read
 
 Write better Texinfo documentation for grep.  The manual page would be a
 good place to start, but Info documents are also supposed to contain a
@@ -28,77 +24,321 @@ Improve the performance of the regex backtracking matcher.  This matcher
 is agonizingly slow, and is responsible for grep sometimes being slower
 than Unix grep when backreferences are used.
 
-Provide support for the Posix [= =] and [. .] constructs.  This is
-difficult because it requires locale-dependent details of the character
-set and collating sequence, but Posix does not standardize any method
-for accessing this information!
+Some test in tests/spencer2.tests should have failed!
+Need to filter out some bugs in dfa.[ch]/regex.[ch].
 
-##
+Threads for grep?
 
-Some test in tests/spencer2.tests should have failed !!!
-Need to filter out some bugs in dfa.[ch]/regex.[ch].
+GNU grep does 32-bit arithmetic, it needs to move to 64-bit.
 
-Threads for grep ?
+Clean up, too many #ifdefs!
 
-GNU grep does 32 bits arithmetic, it needs to move to 64.
+Check some new algorithms for matching; talk to Karl Berry and Nelson.
+Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142)
+claim that his algorithm is faster than Boyer-More. Worth checking.
 
-Clean up, too many #ifdef's !!
+Lazy dynamic linking of libpcre, libz, and libbz2?
 
-Check some new Algorithms for matching, talk to Karl Berry and Nelson.
-Sunday's "Quick Search" Algorithm (CACM 33, 8 August 1990 pp. 132-142)
-claim that his algo. is faster then Boyer-More ????
-Worth Checking.
+Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one
+binary. Is there a possibility of doing even better by automatically
+checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
+0x9D for compress, and 0x42 0x5A 0x68 for bzip2)?
+
+##
 
 Check <http://tony.abou-assaleh.net/greps.html>.
-Take a look at:
-   -- cgrep (Context grep) seems like nice work;
-   -- sgrep (Struct grep);
-   -- agrep (Approximate grep), from glimpse;
-   -- nr-grep (Nondeterministic reverse grep);
-   -- ggrep (Grouse grep);
-   -- grep.py (Python grep);
-   -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep);
-   -- freegrep <http://www.vocito.com/downloads/software/grep/>;
-   -- ja-grep's mlb2 patch (Japanese grep);
-   -- pcregrep (from Perl-Compatible Regular Expressions library).
-Can we merge ?
-
-Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one binary.
-
-POSIX Compliance see p10003.x
+Take a look at these and consider opportunities
+for merging or cloning:
+
+   -- ja-grep's mlb2 patch (Japanese grep)
+      <ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz>
+   -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
+      <http://www.ff.iij4u.or.jp/~nrt/lv/>;
+   -- pcregrep (from Perl-Compatible Regular Expressions library)
+      <http://www.pcre.org/>;
+   -- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/>
+      seems like nice work;
+   -- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
+   -- agrep (Approximate grep) <http://www.tgries.de/agrep/>,
+      from glimpse;
+   -- nr-grep (Nondeterministic reverse grep)
+      <http://www.dcc.uchile.cl/~gnavarro/software/>;
+   -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
+   -- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>;
+   -- freegrep (a BSD-licensed grep for those who can't stand the GNU GPL)
+      <http://www.vocito.com/downloads/software/grep/>;
+
+##
+
+POSIX Compliance: see p10003.x
+
+In general, interesting things to check in POSIX/OpenGroup include:
+
+Provide support for the POSIX [= =] and [. .] constructs. This is
+difficult because it requires locale-dependent details of the
+character set and collating sequence, but POSIX does not standardize
+any method for accessing this information!
 
 Moving away from GNU regex API for POSIX regex API.
 
-Better and faster !!
+##
+
+POSIX and --ignore-case
+
+For this issue, interesting things to check in POSIX include the
+Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in
+particular Section “Regular Expression General Requirements” and its
+paragraph about caseless matching (note that this may not have been
+fully thought through and that this text may be self-contradicting
+[specifically: “of either data or patterns” versus all the rest]).
+
+In particular, consider the following with POSIX's approach to case
+folding in mind. Assume a non-Turkic locale with a character
+repertoire reduced to the following various forms of “LATIN LETTER I”:
+
+0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
+0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
+0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
+0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
+
+First note the differing UTF-8 octet lengths of U+0049 (0x49) and
+U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This
+implies that whole UTF-8 strings cannot be case-converted in place,
+using the same memory buffer, and that the needed octet-size of the
+new buffer cannot merely be guessed (although there's a simple upper
+bound of six times the size of the input, as the longest UTF-8
+encoding of any character is six bytes).
+
+We have
+
+lc(I) = i, uc(I) = I
+lc(i) = i, uc(i) = I
+lc(İ) = i, uc(İ) = İ
+lc(ı) = ı, uc(ı) = I
+
+where lc() and uc() denote lower-case and upper-case conversions.
+
+There are several candidate --ignore-case logics (including the one
+mandated by POSIX):
+
+Using the
+
+if (lc(input_wchar) == lc(pattern_wchar))
+
+logic leads to the following matches:
+
+  \in  I  i  İ  ı
+pat\   ----------
+"I" |  Y  Y  Y  n
+"i" |  Y  Y  Y  n
+"İ" |  Y  Y  Y  n
+"ı" |  n  n  n  Y
+
+There is a lack of symmetry between CAPITAL and SMALL LETTERs with
+this.
+
+Using the
+
+if (uc(input_wchar) == uc(pattern_wchar))
+
+logic leads to the following matches:
+
+  \in  I  i  İ  ı
+pat\   ----------
+"I" |  Y  Y  n  Y
+"i" |  Y  Y  n  Y
+"İ" |  n  n  Y  n
+"ı" |  Y  Y  n  Y
+
+There is a lack of symmetry between CAPITAL and SMALL LETTERs with
+this.
+
+Using the
+
+if (   lc(input_wchar) == lc(pattern_wchar)
+|| uc(input_wchar) == uc(pattern_wchar))
+
+logic leads to the following matches:
+
+  \in  I  i  İ  ı
+pat\   ----------
+"I" |  Y  Y  Y  Y
+"i" |  Y  Y  Y  Y
+"İ" |  Y  Y  Y  n
+"ı" |  Y  Y  n  Y
+
+There is some elegance and symmetry with this. But there are
+potentially two conversions to be made per input character. If the
+pattern is pre-converted, two copies of it need to be kept and used in
+a mutually coherent fashion.
+
+Using the
+
+if (      input_wchar  == pattern_wchar
+|| lc(input_wchar) == pattern_wchar
+|| uc(input_wchar) == pattern_wchar)
+
+logic (as mandated by POSIX) leads to the following matches:
+
+  \in  I  i  İ  ı
+pat\   ----------
+"I" |  Y  Y  n  Y
+"i" |  Y  Y  Y  n
+"İ" |  n  n  Y  n
+"ı" |  n  n  n  Y
+
+There is a different CAPITAL/SMALL symmetry with this. But there's
+also a loss of pattern/input symmetry that's unique to it. Also there
+are potentially two conversions to be made per input character.
+
+Using the
+
+if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
+
+
+logic leads to the following matches:
+
+  \in  I  i  İ  ı
+pat\   ----------
+"I" |  Y  Y  Y  Y
+"i" |  Y  Y  Y  Y
+"İ" |  Y  Y  Y  Y
+"ı" |  Y  Y  Y  Y
+
+This shows total symmetry and transitivity
+(at least in this example analysis).
+There are two conversions to be made per input character,
+but support could be added for having
+a single straight mapping performing
+a composition of the two conversions.
+
+Any optimization in the implementation of each logic
+must not change its basic semantic.
 
 ##
 
-Check POSIX:
-   -- Volume "Base Definitions (XBD)",
-      Chapter "Regular Expressions"
-      and in particular
-      Section "Regular Expression General Requirements"
-      and its paragraph about caseless matching.
-
-Check the Unicode Standard:
-   -- Chapter 3 ("Conformance"),
-      Section 3.13 ("Default Case Operations")
-      and the toCasefold() case conversion operation;
-   -- Chapter 4 ("Character Properties"),
-      Section 4.2 ("Case -- Normative")
-      and the SpecialCasing.txt and CaseFolding.txt
-      files from the Unicode Database;
-   -- Chapter 5 ("Implementation Guidelines"),
-      Section 5.18 ("Case Mappings"),
-      Subsection "Caseless Matching".
-
-Check Unicode Technical Standard #18 ("Unicode Regular Expressions").
-Check Unicode Standard Annex #15 ("Unicode Normalization Forms").
+In general, interesting things to check in Unicode include:
+
+The <http://www.unicode.org/standard/standard.html> Unicode Standard.
+
+Unicode Technical Standard #18 (<http://www.unicode.org/reports/tr18/>
+“Unicode Regular Expressions”).
+
+Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
+“Unicode Normalization Forms”).
 
 ##
 
-Before every release:
-   -- drop dfa.[ch] into a copy of gawk and run "make check";
-   -- send pot file to the Translation Project to get fresh po files;
-   -- get up-to-date version of ABOUT-NLS;
-   -- update NEWS.
+Unicode and --ignore-case
+
+For this issue, interesting things to check in Unicode include:
+
+  -- The Unicode Standard, Chapter 3
+     (<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf>
+     “Conformance”), Section 3.13 (“Default Case Operations”) and the
+     toCasefold() case conversion operation.
+
+  -- The Unicode Standard, Chapter 4
+     (<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf>
+     “Character Properties”), Section 4.2 (“Case—Normative”) and
+     the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
+     SpecialCasing.txt and
+     <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
+     CaseFolding.txt files from the
+     <http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode
+     Character Database .
+
+The <http://www.unicode.org/standard/standard.html> Unicode Standard,
+Chapter 5 (“<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf>
+Implementation Guidelines ”), Section 5.18 (“Case Mappings”),
+Subsection “Caseless Matching”.
+
+The Unicode <http://www.unicode.org/charts/case/> case charts.
+
+Unicode uses the
+
+if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
+
+logic for caseless matching. Let's consider the “LATIN LETTER I”
+example mentioned above. In a non-Turkic locale, simple case folding
+yields
+
+toCasefold_simple(U+0049) = U+0069
+toCasefold_simple(U+0069) = U+0069
+toCasefold_simple(U+0130) = U+0130
+toCasefold_simple(U+0131) = U+0131
+
+which leads to the following matches:
+
+  \in  I  i  İ  ı
+pat\   ----------
+"I" |  Y  Y  n  n
+"i" |  Y  Y  n  n
+"İ" |  n  n  Y  n
+"ı" |  n  n  n  Y
+
+This is different from anything so far!
+
+In a non-Turkic locale, full case folding yields
+
+toCasefold_full(U+0049) = U+0069
+toCasefold_full(U+0069) = U+0069
+toCasefold_full(U+0130) = <U+0069, U+0307>
+toCasefold_full(U+0131) = U+0131
+
+with
+
+0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
+
+which leads to the following matches:
+
+  \in  I  i  İ  ı
+pat\   ----------
+"I" |  Y  Y  *  n
+"i" |  Y  Y  *  n
+"İ" |  n  n  Y  n
+"ı" |  n  n  n  Y
+
+This is just sad!
+
+Note that having toCasefold(U+0131), simple or full, map to itself
+instead of U+0069 is in contradiction with the rules of Section 5.18
+of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same
+thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is
+U+0069. The justification for the weird toCasefold_full(U+0130)
+mapping is unknown; it doesn't even make sense to add a dot (U+0307)
+to a letter that already has one (U+0069). It would have been so
+simple to put them all in the same equivalence class!
+
+Otherwise, also consider the following problem with Unicode's approach
+on case folding in mind. Assume that we want to perform
+
+echo 'AßBC | grep -i 'Sb'
+
+which corresponds to
+
+input:    U+0041 U+00DF U+0042 U+0043 U+000A
+pattern:  U+0053 U+0062
+
+Following “CaseFolding-4.1.0.txt”, applying the toCasefold()
+transformation to these yields
+
+input:    U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
+pattern:                U+0073 U+0062
+
+so, according to this approach, the input should match the pattern. As
+long as the original input line is to be reported to the user as a
+whole, there is no problem (from the user's point-of-view;
+implementation is complicated by this).
+
+However, consider both these GNU extensions:
+
+echo 'AßBC' | grep -i --only-matching 'Sb'
+echo 'AßBC' | grep -i --color=always  'Sb'
+
+What is to be reported in these cases, since the match begins in the
+*middle* of the original input character 'ß'?
+
+Note that Unicode's toCasefold() cannot be implemented in terms of
+POSIX' towctrans() since that can only return a single wint_t value
+per input wint_t value.
-- 
1.6.3.3

