Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns

2015-07-07 Thread Plamen Totev

On 07.07. 2015 at 02:02, Duy Nguyen pclo...@gmail.com wrote: 
 On Tue, Jul 7, 2015 at 3:10 AM, René Scharfe l@web.de wrote: 
  Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy: 

  So the optimization before this patch was that if a string was searched for 
  without -F then it would be treated as a fixed string anyway unless it 
  contained regex special characters. Searching for fixed strings using the 
  kwset functions is faster than using regcomp and regexec, which makes the 
  exercise worthwhile. 
  
  Your patch disables the optimization if non-ASCII characters are searched 
  for because kwset handles case transformations only for ASCII chars. 
  
  Another consequence of this limitation is that -Fi (explicit 
  case-insensitive fixed-string search) doesn't work properly with non-ASCII 
  chars neither. How can we handle this one? Fall back to regcomp by 
  escaping all special characters? Or at least warn? 
 
 Hehe.. I noticed it too shortly after sending the patch. I was torn 
 between simply documenting the limitation and waiting for the next 
 person to come and fix it, or quoting the regex then passing to 
 regcomp. GNU grep does the quoting in this case, but that code is 
 GPLv3 so we can't simply copy over. It could be a problem if we need 
 to quote a regex in a multibyte charset where ascii is not a subset. 
 But i guess we can just go with utf-8.. 

I played a little bit with the code and I came up with this function to escape
regular expressions in  utf-8. Hope it helps.

static void escape_regexp(const char *pattern, size_t len,
                char **new_pattern, size_t *new_len)
{
        const char *p = pattern;
        char *np = *new_pattern = xmalloc(2 * len);
        int chrlen;
        *new_len = len;

        while (len) {
                chrlen = mbs_chrlen(p, len, utf-8);
                if (chrlen == 1  is_regex_special(*pattern))
                        *np++ = '\\';

                memcpy(np, pattern, chrlen);
                np += chrlen;
                pattern = p;
        }

        *new_len = np - *new_pattern;
}

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns

2015-07-06 Thread René Scharfe

Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy:

Noticed-by: Plamen Totev plamen.to...@abv.bg
Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com
---
  grep.c | 14 +++---
  1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index b58c7c6..48db15a 100644
--- a/grep.c
+++ b/grep.c
@@ -378,7 +378,7 @@ static void free_pcre_regexp(struct grep_pat *p)
  }
  #endif /* !USE_LIBPCRE */

-static int is_fixed(const char *s, size_t len)
+static int is_fixed(const char *s, size_t len, int ignore_icase)
  {
size_t i;

@@ -391,6 +391,13 @@ static int is_fixed(const char *s, size_t len)
for (i = 0; i  len; i++) {
if (is_regex_special(s[i]))
return 0;
+   /*
+* The builtin substring search can only deal with case
+* insensitivity in ascii range. If there is something outside
+* of that range, fall back to regcomp.
+*/
+   if (ignore_icase  (unsigned char)s[i] = 128)
+   return 0;


How about isascii(s[i])?


}

return 1;
@@ -398,18 +405,19 @@ static int is_fixed(const char *s, size_t len)

  static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
  {
+   int ignore_icase = opt-regflags  REG_ICASE || p-ignore_case;
int err;

p-word_regexp = opt-word_regexp;
p-ignore_case = opt-ignore_case;


Using p-ignore_case before this line, as in initialization of the new 
variable ignore_icase above, changes the meaning.




-   if (opt-fixed || is_fixed(p-pattern, p-patternlen))
+   if (opt-fixed || is_fixed(p-pattern, p-patternlen, ignore_icase))
p-fixed = 1;
else
p-fixed = 0;

if (p-fixed) {
-   if (opt-regflags  REG_ICASE || p-ignore_case)
+   if (ignore_case)


ignore_icase instead?  ignore_case is for the config variable 
core.ignorecase.  Tricky.



p-kws = kwsalloc(tolower_trans_tbl);
else
p-kws = kwsalloc(NULL);



So the optimization before this patch was that if a string was searched 
for without -F then it would be treated as a fixed string anyway unless 
it contained regex special characters.  Searching for fixed strings 
using the kwset functions is faster than using regcomp and regexec, 
which makes the exercise worthwhile.


Your patch disables the optimization if non-ASCII characters are 
searched for because kwset handles case transformations only for ASCII 
chars.


Another consequence of this limitation is that -Fi (explicit 
case-insensitive fixed-string search) doesn't work properly with 
non-ASCII chars neither.  How can we handle this one?  Fall back to 
regcomp by escaping all special characters?  Or at least warn?


Tests would be nice. :)

René

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns

2015-07-06 Thread Duy Nguyen
On Tue, Jul 7, 2015 at 3:10 AM, René Scharfe l@web.de wrote:
 Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy:

 Noticed-by: Plamen Totev plamen.to...@abv.bg
 Signed-off-by: Nguyễn Thái Ngọc Duy pclo...@gmail.com
 ---
   grep.c | 14 +++---
   1 file changed, 11 insertions(+), 3 deletions(-)

 diff --git a/grep.c b/grep.c
 index b58c7c6..48db15a 100644
 --- a/grep.c
 +++ b/grep.c
 @@ -378,7 +378,7 @@ static void free_pcre_regexp(struct grep_pat *p)
   }
   #endif /* !USE_LIBPCRE */

 -static int is_fixed(const char *s, size_t len)
 +static int is_fixed(const char *s, size_t len, int ignore_icase)
   {
 size_t i;

 @@ -391,6 +391,13 @@ static int is_fixed(const char *s, size_t len)
 for (i = 0; i  len; i++) {
 if (is_regex_special(s[i]))
 return 0;
 +   /*
 +* The builtin substring search can only deal with case
 +* insensitivity in ascii range. If there is something
 outside
 +* of that range, fall back to regcomp.
 +*/
 +   if (ignore_icase  (unsigned char)s[i] = 128)
 +   return 0;


 How about isascii(s[i])?

Yes, better.


 }

 return 1;
 @@ -398,18 +405,19 @@ static int is_fixed(const char *s, size_t len)

   static void compile_regexp(struct grep_pat *p, struct grep_opt *opt)
   {
 +   int ignore_icase = opt-regflags  REG_ICASE || p-ignore_case;
 int err;

 p-word_regexp = opt-word_regexp;
 p-ignore_case = opt-ignore_case;


 Using p-ignore_case before this line, as in initialization of the new
 variable ignore_icase above, changes the meaning.

Oops.

 -   if (opt-fixed || is_fixed(p-pattern, p-patternlen))
 +   if (opt-fixed || is_fixed(p-pattern, p-patternlen,
 ignore_icase))
 p-fixed = 1;
 else
 p-fixed = 0;

 if (p-fixed) {
 -   if (opt-regflags  REG_ICASE || p-ignore_case)
 +   if (ignore_case)


 ignore_icase instead?  ignore_case is for the config variable
 core.ignorecase.  Tricky.

Maybe we can test isascii separately and save the result in
has_non_ascii, then we can avoid ignore_(i)case


 p-kws = kwsalloc(tolower_trans_tbl);
 else
 p-kws = kwsalloc(NULL);


 So the optimization before this patch was that if a string was searched for
 without -F then it would be treated as a fixed string anyway unless it
 contained regex special characters.  Searching for fixed strings using the
 kwset functions is faster than using regcomp and regexec, which makes the
 exercise worthwhile.

 Your patch disables the optimization if non-ASCII characters are searched
 for because kwset handles case transformations only for ASCII chars.

 Another consequence of this limitation is that -Fi (explicit
 case-insensitive fixed-string search) doesn't work properly with non-ASCII
 chars neither.  How can we handle this one?  Fall back to regcomp by
 escaping all special characters?  Or at least warn?

Hehe.. I noticed it too shortly after sending the patch. I was torn
between simply documenting the limitation and waiting for the next
person to come and fix it, or quoting the regex then passing to
regcomp. GNU grep does the quoting in this case, but that code is
GPLv3 so we can't simply copy over. It could be a problem if we need
to quote a regex in a multibyte charset where ascii is not a subset.
But i guess we can just go with utf-8..

 Tests would be nice. :)

Yeah.. but we now rely on system regcomp which may behave differently
across platforms. Then we need some locale to be always there. Some
platforms (like Gentoo) even allow building glibc without i18n.. So
I'm not sure how we know when to test or skip.
-- 
Duy
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html