Re: [PHP] case and accent - insensitive regular expression?

Andrew Ballard Tue, 15 Jul 2008 06:48:01 -0700

On Tue, Jul 15, 2008 at 5:38 AM, Yeti <[EMAIL PROTECTED]> wrote:
> I dont think using all these regular expressions is a very efficient way to
> do so. As i previously pointed out there are many users who had a similar
> problem, which can be viewed at:
>
> http://it.php.net/manual/en/function.strtr.php
>
> One of my favourites is what derernst at gmx dot ch used.
>
> derernst at gmx dot ch
> wrote on 20-Sep-2005 07:29
> This works for me to remove accents for some characters of Latin-1, Latin-2
> and Turkish in a UTF-8 environment, where the htmlentities-based solutions
> fail:
>
>> <?php
>
> function remove_accents($string, $german=false) {
>
>   // Single letters
>
>   $single_fr = explode(" ", "� � � � � � &#260; &#258; � &#262; &#268;
> &#270; &#272; � � � � � &#280; &#282; &#286; � � � � &#304; &#321; &#317;
> &#313; � &#323; &#327; � � � � � � &#336; &#340; &#344; � &#346; &#350;
> &#356; &#354; � � � � &#366; &#368; � � &#377; &#379; � � � � � � &#261;
> &#259; � &#263; &#269; &#271; &#273; � � � � &#281; &#283; &#287; � � � �
> &#305; &#322; &#318; &#314; � &#324; &#328; � � � � � � � &#337; &#341;
> &#345; &#347; � &#351; &#357; &#355; � � � � &#367; &#369; � � � &#378;
> &#380;");
>
>   $single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I I
> I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a a
> a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s s
> t t u u u u u u y y z z z");
>
>   $single = array();
>
>   for ($i=0; $i<count($single_fr); $i++) {
>
>   $single[$single_fr[$i]] = $single_to[$i];
>
>   }
>
>   // Ligatures
>
>   $ligatures = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"ss");
>
>   // German umlauts
>
>   $umlauts = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"Ue",
> "�"=>"ue");
>
>   // Replace
>
>   $replacements = array_merge($single, $ligatures);
>
>   if ($german) $replacements = array_merge($replacements, $umlauts);
>
>   $string = strtr($string, $replacements);
>
>   return $string;
>
> }
>
> ?>
>
> I would change this function a bit ...
>
> <?php
> //echo rawurlencode("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One
> might use UTF-8 as this documents encoding
> function remove_accents($string) {
>  $string = rawurlencode($string);
>  $replacements = array(
>  '%C3%A1' => 'a',
>  '%C3%A0' => 'a',
>  '%C3%A9' => 'e',
>  '%C3%A8' => 'e',
>  '%C3%AD' => 'i',
>  '%C3%AC' => 'i',
>  '%C3%B3' => 'o',
>  '%C3%B2' => 'o',
>  '%C3%BA' => 'u',
>  '%C3%B9' => 'u',
>  '%C3%81' => 'A',
>  '%C3%80' => 'A',
>  '%C3%89' => 'E',
>  '%C3%88' => 'E',
>  '%C3%8D' => 'I',
>  '%C3%8C' => 'I',
>  '%C3%93' => 'O',
>  '%C3%92' => 'O',
>  '%C3%9A' => 'U',
>  '%C3%99' => 'U'
>  );
>  return strtr($string, $replacements);
> }
> //echo remove_accents("CÀfé"); // I know it's not spelled right
> echo remove_accents("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8
> for document): aaeeiioouuAAEEIIOOUU
> ?>
>
> Ciao
>
> Yeti
>
> On Mon, Jul 14, 2008 at 8:20 PM, Andrew Ballard <[EMAIL PROTECTED]> wrote:
>>
>> On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti
>> <[EMAIL PROTECTED]> wrote:
>> >>
>> >
>> > Brilliant !!!
>> >
>> > so you replace every occurence of every accent variation with all the
>> > accent
>> > variations...
>> >
>> > OK, that's it!
>> >
>> > only some more doubts ( regex are still an headhache for me... )
>> >
>> > preg_replace('/[iìíîïĩīĭįı]/iu',...  -- what's the meaning of iu after
>> > the
>> > match string?
>>
>> This page explains them both.
>> http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php
>>
>> > preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e)  for? -- every
>> > occurence of aàáâãäåǻāăą NOT followed by e?
>>
>> Yes. It matches any character based on the latin 'a' that is not
>> followed by an 'e'. It keeps the pattern from matching the 'a' when it
>> immediately precedes an 'e' for the character 'ae' for words like
>> these:
>>
>> http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature
>> (However, that may cause problems with words that have other variants
>> of 'ae' in them. I'll leave that to you to resolve.)
>> http://us.php.net/manual/en/regexp.reference.php
>>
>>
>>
>> > Many thanks again for your effort,
>> >
>> > I'm definitely on the good way
>> >
>> >      Giulio
>> >
>> >
>> >>
>> >> I was intrigued by your example, so I played around with it some more
>> >> this morning. My own quick web search yielded a lot of results for
>> >> highlighting search terms, but none that I found did what you're
>> >> after. (I admit I didn't look very deep.) I was up to something like
>> >> this before your reply came in. It's still by no means complete. It
>> >> even handles simple English plurals (words ending in 's' or 'es'), but
>> >> not variations that require changing the word base (like 'daisy' to
>> >> 'daisies').
>> >>
>> >> <?php
>> >> function highlight_search_terms($phrase, $string) {
>> >>   $non_letter_chars = '/[^\pL]/iu';
>> >>   $words = preg_split($non_letter_chars, $phrase);
>> >>
>> >>   $search_words = array();
>> >>   foreach ($words as $word) {
>> >>       if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) {
>> >>           $search_words[] = $word;
>> >>       }
>> >>   }
>> >>
>> >>   $search_words = array_unique($search_words);
>> >>
>> >>   foreach ($search_words as $word) {
>> >>       $search = preg_quote($word);
>> >>
>> >>       /* repeat for each possible accented character */
>> >>       $search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search);
>> >>       $search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search);
>> >>       $search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu',
>> >> '[aàáâãäåǻāăą]', $search);
>> >>       $search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search);
>> >>       $search = preg_replace('/[dďđ]/iu', '[dďđ]', $search);
>> >>       $search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu',
>> >> '[eèéêëēĕėęě]', $search);
>> >>       $search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search);
>> >>       $search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search);
>> >>       $search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]',
>> >> $search);
>> >>       $search = preg_replace('/[jĵ]/iu', '[jĵ]', $search);
>> >>       $search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search);
>> >>       $search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search);
>> >>       $search = preg_replace('/[nñńņňŉŋ]/iu', '[nñńņňŉŋ]', $search);
>> >>       $search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu',
>> >> '[oòóôõöōŏőǿơ]', $search);
>> >>       $search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search);
>> >>       $search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search);
>> >>       $search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search);
>> >>       $search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu',
>> >> '[uùúûüũūŭůűųǔǖǘǚǜ]', $search);
>> >>       $search = preg_replace('/[wŵ]/iu', '[wŵ]', $search);
>> >>       $search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search);
>> >>       $search = preg_replace('/[zźżž]/iu', '[zźżž]', $search);
>> >>
>> >>
>> >>       $string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span
>> >> class="keysearch">$0</span>', $string);
>> >>   }
>> >>
>> >>   return $string;
>> >>
>> >> }
>> >> ?>
>> >>
>> >> I still can't help feeling there must be some better way, though.
>> >>
>> >>>
>> >>> well, i think I'm on the good way now, unfortunately I have some other
>> >>> urgent work and can't try it immediately, but I'll let you know    :)
>> >>>
>> >>> thank you!
>> >>>
>> >>>   Giulio
>> >>
>> >>
>> >> Andrew
>> >>
>> >>
>> >
>> >
>
>


I agree it doesn't seem very efficient to me, but I haven't come up
with anything better. The problem with what you posted is that the OP
was looking to preserve the accented characters, NOT replace them. All
he wants to do is wrap some tags around the search terms so that they
are highlighted. I guess he could use your function to replace all the
accented characters with regular ones in a copy of the original
string, and then scan that string using str_pos() or similar against
the copy to find the index of each occurrence that needs replaced in
the original string. This seems even less efficient than the regular
expressions, to me.

Andrew

Re: [PHP] case and accent - insensitive regular expression?

Reply via email to