Re: [PATCH v2 7/9] grep/pcre: support utf-8
Nguyễn Thái Ngọc Duy pclo...@gmail.com writes: In the previous change in this function, we add locale support for single-byte encodings only. It looks like pcre only supports utf-* as multibyte encodings, the others are left in the cold (which is fine). We need to enable PCRE_UTF8 so pcre can parse the string correctly before folding case. if (opt-ignore_case) { p-pcre_tables = pcre_maketables(); + if (is_utf8_locale()) + options |= PCRE_UTF8; options |= PCRE_CASELESS; } We need to set the PCRE_UTF8 flag in all cases when the locale is UTF-8 not only when the search is case insensitive. Otherwise pcre threats the encoding as single byte and if the regex contains quantifiers it will not work as expected. The quantifier will try to match the second byte of the multi-byte symbol instead of the whole symbol. For example lets have file that contains the string TILRAUN: HALLÓÓÓ HEIMUR! the following command git grep -P HALLÓ{3} will not match the file while git grep -P HAL{2}ÓÓÓ will. That's because the L symbol is a single byte. Regards, Plamen Totev -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Git grep does not support multi-byte characters (like UTF-8)
Nguyen, thanks for the help and the patch. Also the escaping suggested by Scharfe seems as good choice. But i dig some more into the problem and I found some other thing. That's why I replied on the main thread not on the patch. I hope you'll excuse me if this is a bad practice. git grep -i -P also does not works because the PCRE_UTF8 is not set and pcre library does not treat the string as UTF-8. pickaxe search also uses kwsearch so the case insensitive search with it does not work (e.g. git log -i -S). Maybe this is a less of a problem here as one is expected to search for exact string (hence knows the case) There is a interesting corner case. is_fixed treats all patterns containing nulls as fixed. So what about if the string contains non-ASCII symbols as well as nulls and the search is case insensitive :) I have to admin that my knowledge in UTF-8 is not enough to answer the question if this could occur during normal usage. For example the second byte in multi-byte symbol is NULL. I would guess that's not true as it would break a lot of programs that depend on NULL delimited string but it's good if somebody could confirm. GNU grep indeed uses escaped regular expressions when the string is using multi-byte encoding and the search is case insensitive. If the encoding is UTF-8 then this strategy could be used in git too. Especially that git already have support and helper functions to work with UTF-8. As for the other multi-byte encodings - I think the things would become more complicated. As far I know in UTF-8 the '{' character for example is two bytes not one. Maybe really a support could be added only for the UTF-8 and if the string is not UTF-8 to issue a warning. So maybe the following makes sense when a grep search is performed: * check if the multi-byte encoding is used. If it's and the search is case insensitive and the encoding is not UTF-8 give a warning; * if pcre is used and the string is UTF-8 encoded set the PCRE_UTF8 flag; * if the search is case insensitive, the string is fixed and the encoding used is UTF-8 use regcomp instead of kwsearch and escape any regex special characters in the pattern; And the question with the behavior of pickaxe search remains open. Using kwset does not work with case insensitive non-ASCII searches. Instead of fixing grep.c maybe it's better if new function is introduced that performs keyword searches so it could be used by both grep, diffcore-pickaxe and any other code in the future that may require such functionality. Or maybe diffcore-pickaxe should use grep instead of directly kwset/regcomp Regards, Plamen Totev Оригинално писмо От: Duy Nguyen pclo...@gmail.com Относно: Re: Git grep does not support multi-byte characters (like UTF-8) До: Plamen Totev plamen.to...@abv.bg Изпратено на: 06.07.2015 15:23 I think we over-optimized a bit. If you your system provides regex with locale support (e.g. Linux) and you don't explicitly use fallback regex implementation, it should work. I suppose your test patterns look fixed (i.e. no regex special characters)? Can you try just add . and see if case insensitive matching works? This is indeed the problem. When I added the . the matching works just fine. -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Git grep does not support multi-byte characters (like UTF-8)
Junio C Hamano gits...@pobox.com writes: Plamen Totev plamen.to...@abv.bg writes: pickaxe search also uses kwsearch so the case insensitive search with it does not work (e.g. git log -i -S). Maybe this is a less of a problem here as one is expected to search for exact string (hence knows the case) You reasoned correctly, I think. Pickaxe, as one of the building blocks to implement Linus's ultimate change tracking tool [*1*], should never pay attention to -i. It is a step to finding the commit that touches the exact code block given (i.e. how do you drill down? part of $gmane/217 message). Thanks. [Footnote] *1* http://article.gmane.org/gmane.comp.version-control.git/217 Now that I read the link again and gave the matter a thought I'm not so sure. In some contexts the case of the words does not matter. In SQL for example. Let's consider a SQL script file that contains the following line: select name, address from customers; At some point we decide to change the coding style to: SELECT name, address FROM customers; What should pickaxe search return - the first commit where the line is introduced or the commit with the refactoring? From this point of view I think the -i switch makes sense. The SQL is not the only case insensitive language - BASIC and Pascal come into my mind (those two I was using while I was in the high school :)). Also I think it makes sense (maybe even more?) for natural languages. For example after editing a text a sentence could be split into two. Then the first word of the second sentence may change its case. Of course the natural languages always complicate the things a bit. An ultimate tracking tools should be able to handle typo fixes, punctuation changes, etc. But I'm getting a bit off-topic. What I wanted to say is that in some contexts it makes sense (at least to me) to have case insensitive pickaxe search. Or I'm missing something and there is a better tools to use is such cases? Regards, Plamen Totev -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] grep: use regcomp() for icase search with non-ascii patterns
On 07.07. 2015 at 02:02, Duy Nguyen pclo...@gmail.com wrote: On Tue, Jul 7, 2015 at 3:10 AM, René Scharfe l@web.de wrote: Am 06.07.2015 um 14:42 schrieb Nguyễn Thái Ngọc Duy: So the optimization before this patch was that if a string was searched for without -F then it would be treated as a fixed string anyway unless it contained regex special characters. Searching for fixed strings using the kwset functions is faster than using regcomp and regexec, which makes the exercise worthwhile. Your patch disables the optimization if non-ASCII characters are searched for because kwset handles case transformations only for ASCII chars. Another consequence of this limitation is that -Fi (explicit case-insensitive fixed-string search) doesn't work properly with non-ASCII chars neither. How can we handle this one? Fall back to regcomp by escaping all special characters? Or at least warn? Hehe.. I noticed it too shortly after sending the patch. I was torn between simply documenting the limitation and waiting for the next person to come and fix it, or quoting the regex then passing to regcomp. GNU grep does the quoting in this case, but that code is GPLv3 so we can't simply copy over. It could be a problem if we need to quote a regex in a multibyte charset where ascii is not a subset. But i guess we can just go with utf-8.. I played a little bit with the code and I came up with this function to escape regular expressions in utf-8. Hope it helps. static void escape_regexp(const char *pattern, size_t len, char **new_pattern, size_t *new_len) { const char *p = pattern; char *np = *new_pattern = xmalloc(2 * len); int chrlen; *new_len = len; while (len) { chrlen = mbs_chrlen(p, len, utf-8); if (chrlen == 1 is_regex_special(*pattern)) *np++ = '\\'; memcpy(np, pattern, chrlen); np += chrlen; pattern = p; } *new_len = np - *new_pattern; } -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Git grep does not support multi-byte characters (like UTF-8)
Hello, It looks like the git grep command does not support multi-byte character sets like UTF-8. As a result some of the grep functionality is not working. For example if you search for non Latin words the ignore case flag does not have effect(the search is case sensitive). I suspect there are some regular expressions that will not work as expected too. When I'm using the git from the shell I could use the GNU grep utility instead. But some tools like gitweb use git grep so they are not working properly with UTF-8 files with non Latin characters. Much of the grep code seems to be copied from the GNU grep utility but the multi-byte support is left out. I just wondered what could be the reason. Regards, Plamen Totev -- To unsubscribe from this list: send the line unsubscribe git in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html