Re: [PATCH v2 7/9] grep/pcre: support utf-8

2015-07-11 Thread Plamen Totev
Nguyễn Thái Ngọc Duy pclo...@gmail.com writes:
 In the previous change in this function, we add locale support for 
 single-byte encodings only. It looks like pcre only supports utf-* as 
 multibyte encodings, the others are left in the cold (which is 
 fine). We need to enable PCRE_UTF8 so pcre can parse the string 
 correctly before folding case. 

 if (opt-ignore_case) { 
 p-pcre_tables = pcre_maketables(); 
 + if (is_utf8_locale()) 
 + options |= PCRE_UTF8; 
 options |= PCRE_CASELESS; 
 } 

We need to set the PCRE_UTF8 flag in all cases when the locale is UTF-8
not only when the search is case insensitive.
Otherwise pcre threats the encoding as single byte and if the regex contains
quantifiers it will not work as expected. The quantifier will try to match the
second byte of the multi-byte symbol instead of the whole symbol.

For example lets have file that contains the string

TILRAUN: HALLÓÓÓ HEIMUR!

the following command

git grep -P HALLÓ{3}

will not match the file while 

git grep -P HAL{2}ÓÓÓ

will. That's because the L symbol is a single byte.

Regards,
Plamen Totev

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Git grep does not support multi-byte characters (like UTF-8)

2015-07-07 Thread Plamen Totev
Nguyen, thanks for the help and the patch. Also the escaping suggested by 
Scharfe seems as good choice. But i dig some more into the problem and I found 
some other thing. That's why I replied on the main thread not on the patch. I 
hope you'll excuse me if this is a bad practice.

git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre 
library does not treat the string as UTF-8.

pickaxe search also uses kwsearch so the case insensitive search with it does 
not work (e.g. git log -i -S).  Maybe this is a less of a problem here as one 
is expected to search for exact string (hence knows the case)

There is a interesting corner case. is_fixed treats all patterns containing 
nulls as fixed. So what about if the string contains non-ASCII symbols as well 
as nulls and the search is case insensitive :) I have to admin that my 
knowledge in UTF-8 is not enough to answer the question if this could occur 
during normal usage. For example the second byte in multi-byte symbol is NULL. 
I would guess that's not true as it would break a lot of programs that depend 
on NULL delimited string but it's good if somebody could confirm.

GNU grep indeed uses escaped regular expressions when the string is using 
multi-byte encoding and the search is case insensitive. If the encoding is 
UTF-8 then this strategy could be used in git too. Especially that git already 
have support and helper functions to work with UTF-8. As for the other 
multi-byte encodings - I think the things would become more complicated. As far 
I know in UTF-8 the '{' character for example is two bytes not one. Maybe 
really a support could be added only for the UTF-8 and if the string is not 
UTF-8 to issue a warning.

So maybe the following makes sense when a grep search is performed:
* check if the multi-byte encoding is used. If it's and the search is case 
insensitive and the encoding is not UTF-8 give a warning;
* if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag;
* if the search is case insensitive, the string is fixed and the encoding  used 
is UTF-8 use regcomp instead of kwsearch and escape any regex special 
characters in the pattern;

And the question with the behavior of pickaxe search remains open. Using kwset 
does not work with case insensitive non-ASCII searches. Instead of fixing 
grep.c maybe it's better if new function is introduced that performs keyword 
searches so it could be used by both grep, diffcore-pickaxe and any other code 
in the future that may require such functionality. Or maybe diffcore-pickaxe 
should use grep instead of directly kwset/regcomp

Regards,
Plamen Totev



 Оригинално писмо  
От: Duy Nguyen pclo...@gmail.com 
Относно: Re: Git grep does not support multi-byte characters (like UTF-8) 
До: Plamen Totev plamen.to...@abv.bg 
Изпратено на: 06.07.2015 15:23 

 I think we over-optimized a bit. If you your system provides regex 
 with locale support (e.g. Linux) and you don't explicitly use fallback 
 regex implementation, it should work. I suppose your test patterns 
 look fixed (i.e. no regex special characters)? Can you try just add 
 . and see if case insensitive matching works? 

This is indeed the problem. When I added the . the matching works just fine.
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Git grep does not support multi-byte characters (like UTF-8)

2015-07-07 Thread Plamen Totev
Junio C Hamano gits...@pobox.com writes:

 Plamen Totev plamen.to...@abv.bg writes: 
 
  pickaxe search also uses kwsearch so the case insensitive search with 
  it does not work (e.g. git log -i -S). Maybe this is a less of a 
  problem here as one is expected to search for exact string (hence 
  knows the case) 
 
 You reasoned correctly, I think. Pickaxe, as one of the building 
 blocks to implement Linus's ultimate change tracking tool [*1*], 
 should never pay attention to -i. It is a step to finding the 
 commit that touches the exact code block given (i.e. how do you 
 drill down? part of $gmane/217 message). 
 
 Thanks. 
 
 [Footnote] 
 *1* http://article.gmane.org/gmane.comp.version-control.git/217

Now that I read the link again and gave the matter a thought I'm not so sure.
In some contexts the case of the words does not matter. In SQL for example.
Let's consider a SQL script file that contains the following line:

select name, address from customers;

At some point we decide to change the coding style to:

SELECT name, address FROM customers;

What should pickaxe search return - the first commit where the line is 
introduced
or the commit with the refactoring? From this point of view I think the -i 
switch makes sense.
The SQL is not the only case insensitive language - BASIC and Pascal come into 
my mind 
(those two I was using while I was in the high school :)).

Also I think it makes sense (maybe even more?) for natural languages.
For example after editing a text a sentence could be split into two.
Then the first word of the second sentence may change its case.
Of course the natural languages always  complicate the things a bit.
An ultimate tracking tools should be able to handle typo fixes, punctuation 
changes, etc.

But I'm getting a bit off-topic. What I wanted to say is that in some contexts 
it makes sense
(at least to me) to have case insensitive pickaxe search.
Or I'm missing something and there is a better tools to use is such cases?

Regards,
Plamen Totev
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns

2015-07-07 Thread Plamen Totev

On 07.07. 2015 at 02:02, Duy Nguyen pclo...@gmail.com wrote: 
 On Tue, Jul 7, 2015 at 3:10 AM, René Scharfe l@web.de wrote: 
  Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy: 

  So the optimization before this patch was that if a string was searched for 
  without -F then it would be treated as a fixed string anyway unless it 
  contained regex special characters. Searching for fixed strings using the 
  kwset functions is faster than using regcomp and regexec, which makes the 
  exercise worthwhile. 
  
  Your patch disables the optimization if non-ASCII characters are searched 
  for because kwset handles case transformations only for ASCII chars. 
  
  Another consequence of this limitation is that -Fi (explicit 
  case-insensitive fixed-string search) doesn't work properly with non-ASCII 
  chars neither. How can we handle this one? Fall back to regcomp by 
  escaping all special characters? Or at least warn? 
 
 Hehe.. I noticed it too shortly after sending the patch. I was torn 
 between simply documenting the limitation and waiting for the next 
 person to come and fix it, or quoting the regex then passing to 
 regcomp. GNU grep does the quoting in this case, but that code is 
 GPLv3 so we can't simply copy over. It could be a problem if we need 
 to quote a regex in a multibyte charset where ascii is not a subset. 
 But i guess we can just go with utf-8.. 

I played a little bit with the code and I came up with this function to escape
regular expressions in  utf-8. Hope it helps.

static void escape_regexp(const char *pattern, size_t len,
                char **new_pattern, size_t *new_len)
{
        const char *p = pattern;
        char *np = *new_pattern = xmalloc(2 * len);
        int chrlen;
        *new_len = len;

        while (len) {
                chrlen = mbs_chrlen(p, len, utf-8);
                if (chrlen == 1  is_regex_special(*pattern))
                        *np++ = '\\';

                memcpy(np, pattern, chrlen);
                np += chrlen;
                pattern = p;
        }

        *new_len = np - *new_pattern;
}

--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Git grep does not support multi-byte characters (like UTF-8)

2015-07-06 Thread Plamen Totev
Hello, 

It looks like the git grep command does not support multi-byte character sets 
like UTF-8. As a result some of the grep functionality is not working. For 
example if you search for non Latin words the ignore case flag does not have 
effect(the search is case sensitive). I suspect there are some regular 
expressions that will not work as expected too. 

When I'm using the git from the shell I could use the GNU grep utility instead. 
But some tools like gitweb use git grep so they are not working properly with 
UTF-8 files with non Latin characters. 

Much of the grep code seems to be copied from the GNU grep utility but the 
multi-byte support is left out. I just wondered what could be the reason. 

Regards, 
Plamen Totev
--
To unsubscribe from this list: send the line unsubscribe git in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html