On Tue, Jul 15, 2008 at 9:46 AM, Andrew Ballard <[EMAIL PROTECTED]> wrote: > On Tue, Jul 15, 2008 at 5:38 AM, Yeti <[EMAIL PROTECTED]> wrote: >> I dont think using all these regular expressions is a very efficient way to >> do so. As i previously pointed out there are many users who had a similar >> problem, which can be viewed at: >> >> http://it.php.net/manual/en/function.strtr.php >> >> One of my favourites is what derernst at gmx dot ch used. >> >> derernst at gmx dot ch >> wrote on 20-Sep-2005 07:29 >> This works for me to remove accents for some characters of Latin-1, Latin-2 >> and Turkish in a UTF-8 environment, where the htmlentities-based solutions >> fail: >> >>> <?php >> >> function remove_accents($string, $german=false) { >> >> // Single letters >> >> $single_fr = explode(" ", "� � � � � � Ą Ă � Ć Č >> Ď Đ � � � � � Ę Ě Ğ � � � � İ Ł Ľ >> Ĺ � Ń Ň � � � � � � Ő Ŕ Ř � Ś Ş >> Ť Ţ � � � � Ů Ű � � Ź Ż � � � � � � ą >> ă � ć č ď đ � � � � ę ě ğ � � � � >> ı ł ľ ĺ � ń ň � � � � � � � ő ŕ >> ř ś � ş ť ţ � � � � ů ű � � � ź >> ż"); >> >> $single_to = explode(" ", "A A A A A A A A C C C D D D E E E E E E G I I I >> I I L L L N N N O O O O O O O R R S S S T T U U U U U U Y Z Z Z a a a a a a >> a a c c c d d e e e e e e g i i i i i l l l n n n o o o o o o o o r r s s s >> t t u u u u u u y y z z z"); >> >> $single = array(); >> >> for ($i=0; $i<count($single_fr); $i++) { >> >> $single[$single_fr[$i]] = $single_to[$i]; >> >> } >> >> // Ligatures >> >> $ligatures = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"ss"); >> >> // German umlauts >> >> $umlauts = array("�"=>"Ae", "�"=>"ae", "�"=>"Oe", "�"=>"oe", "�"=>"Ue", >> "�"=>"ue"); >> >> // Replace >> >> $replacements = array_merge($single, $ligatures); >> >> if ($german) $replacements = array_merge($replacements, $umlauts); >> >> $string = strtr($string, $replacements); >> >> return $string; >> >> } >> >> ?> >> >> I would change this function a bit ... >> >> <?php >> //echo rawurlencode("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); // RFC 1738 codes; NOTE: One >> might use UTF-8 as this documents encoding >> function remove_accents($string) { >> $string = rawurlencode($string); >> $replacements = array( >> '%C3%A1' => 'a', >> '%C3%A0' => 'a', >> '%C3%A9' => 'e', >> '%C3%A8' => 'e', >> '%C3%AD' => 'i', >> '%C3%AC' => 'i', >> '%C3%B3' => 'o', >> '%C3%B2' => 'o', >> '%C3%BA' => 'u', >> '%C3%B9' => 'u', >> '%C3%81' => 'A', >> '%C3%80' => 'A', >> '%C3%89' => 'E', >> '%C3%88' => 'E', >> '%C3%8D' => 'I', >> '%C3%8C' => 'I', >> '%C3%93' => 'O', >> '%C3%92' => 'O', >> '%C3%9A' => 'U', >> '%C3%99' => 'U' >> ); >> return strtr($string, $replacements); >> } >> //echo remove_accents("CÀfé"); // I know it's not spelled right >> echo remove_accents("áàéèíìóòúùÁÀÉÈÍÌÓÒÚÙ"); //OUTPUT (again: i used UTF-8 >> for document): aaeeiioouuAAEEIIOOUU >> ?> >> >> Ciao >> >> Yeti >> >> On Mon, Jul 14, 2008 at 8:20 PM, Andrew Ballard <[EMAIL PROTECTED]> wrote: >>> >>> On Mon, Jul 14, 2008 at 1:35 PM, Giulio Mastrosanti >>> <[EMAIL PROTECTED]> wrote: >>> >> >>> > >>> > Brilliant !!! >>> > >>> > so you replace every occurence of every accent variation with all the >>> > accent >>> > variations... >>> > >>> > OK, that's it! >>> > >>> > only some more doubts ( regex are still an headhache for me... ) >>> > >>> > preg_replace('/[iìíîïĩīĭįı]/iu',... -- what's the meaning of iu after >>> > the >>> > match string? >>> >>> This page explains them both. >>> http://us.php.net/manual/en/reference.pcre.pattern.modifiers.php >>> >>> > preg_replace('/[aàáâãäåǻāăą](?!e)/iu',... whats (?!e) for? -- every >>> > occurence of aàáâãäåǻāăą NOT followed by e? >>> >>> Yes. It matches any character based on the latin 'a' that is not >>> followed by an 'e'. It keeps the pattern from matching the 'a' when it >>> immediately precedes an 'e' for the character 'ae' for words like >>> these: >>> >>> http://en.wikipedia.org/wiki/List_of_words_that_may_be_spelled_with_a_ligature >>> (However, that may cause problems with words that have other variants >>> of 'ae' in them. I'll leave that to you to resolve.) >>> http://us.php.net/manual/en/regexp.reference.php >>> >>> >>> >>> > Many thanks again for your effort, >>> > >>> > I'm definitely on the good way >>> > >>> > Giulio >>> > >>> > >>> >> >>> >> I was intrigued by your example, so I played around with it some more >>> >> this morning. My own quick web search yielded a lot of results for >>> >> highlighting search terms, but none that I found did what you're >>> >> after. (I admit I didn't look very deep.) I was up to something like >>> >> this before your reply came in. It's still by no means complete. It >>> >> even handles simple English plurals (words ending in 's' or 'es'), but >>> >> not variations that require changing the word base (like 'daisy' to >>> >> 'daisies'). >>> >> >>> >> <?php >>> >> function highlight_search_terms($phrase, $string) { >>> >> $non_letter_chars = '/[^\pL]/iu'; >>> >> $words = preg_split($non_letter_chars, $phrase); >>> >> >>> >> $search_words = array(); >>> >> foreach ($words as $word) { >>> >> if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) { >>> >> $search_words[] = $word; >>> >> } >>> >> } >>> >> >>> >> $search_words = array_unique($search_words); >>> >> >>> >> foreach ($search_words as $word) { >>> >> $search = preg_quote($word); >>> >> >>> >> /* repeat for each possible accented character */ >>> >> $search = preg_replace('/(ae|æ|ǽ)/iu', '(ae|æ|ǽ)', $search); >>> >> $search = preg_replace('/(oe|œ)/iu', '(oe|œ)', $search); >>> >> $search = preg_replace('/[aàáâãäåǻāăą](?!e)/iu', >>> >> '[aàáâãäåǻāăą]', $search); >>> >> $search = preg_replace('/[cçćĉċč]/iu', '[cçćĉċč]', $search); >>> >> $search = preg_replace('/[dďđ]/iu', '[dďđ]', $search); >>> >> $search = preg_replace('/(?<![ao])[eèéêëēĕėęě]/iu', >>> >> '[eèéêëēĕėęě]', $search); >>> >> $search = preg_replace('/[gĝğġģ]/iu', '[gĝğġģ]', $search); >>> >> $search = preg_replace('/[hĥħ]/iu', '[hĥħ]', $search); >>> >> $search = preg_replace('/[iìíîïĩīĭįı]/iu', '[iìíîïĩīĭįı]', >>> >> $search); >>> >> $search = preg_replace('/[jĵ]/iu', '[jĵ]', $search); >>> >> $search = preg_replace('/[kķĸ]/iu', '[kķĸ]', $search); >>> >> $search = preg_replace('/[lĺļľŀł]/iu', '[lĺļľŀł]', $search); >>> >> $search = preg_replace('/[nñńņňʼnŋ]/iu', '[nñńņňʼnŋ]', $search); >>> >> $search = preg_replace('/[oòóôõöōŏőǿơ](?!e)/iu', >>> >> '[oòóôõöōŏőǿơ]', $search); >>> >> $search = preg_replace('/[rŕŗř]/iu', '[rŕŗř]', $search); >>> >> $search = preg_replace('/[sśŝşš]/iu', '[sśŝşš]', $search); >>> >> $search = preg_replace('/[tţťŧ]/iu', '[tţťŧ]', $search); >>> >> $search = preg_replace('/[uùúûüũūŭůűųǔǖǘǚǜ]/iu', >>> >> '[uùúûüũūŭůűųǔǖǘǚǜ]', $search); >>> >> $search = preg_replace('/[wŵ]/iu', '[wŵ]', $search); >>> >> $search = preg_replace('/[yýÿŷ]/iu', '[yýÿŷ]', $search); >>> >> $search = preg_replace('/[zźżž]/iu', '[zźżž]', $search); >>> >> >>> >> >>> >> $string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span >>> >> class="keysearch">$0</span>', $string); >>> >> } >>> >> >>> >> return $string; >>> >> >>> >> } >>> >> ?> >>> >> >>> >> I still can't help feeling there must be some better way, though. >>> >> >>> >>> >>> >>> well, i think I'm on the good way now, unfortunately I have some other >>> >>> urgent work and can't try it immediately, but I'll let you know :) >>> >>> >>> >>> thank you! >>> >>> >>> >>> Giulio >>> >> >>> >> >>> >> Andrew >>> >> >>> >> >>> > >>> > >> >> > > I agree it doesn't seem very efficient to me, but I haven't come up > with anything better. The problem with what you posted is that the OP > was looking to preserve the accented characters, NOT replace them. All > he wants to do is wrap some tags around the search terms so that they > are highlighted. I guess he could use your function to replace all the > accented characters with regular ones in a copy of the original > string, and then scan that string using str_pos() or similar against > the copy to find the index of each occurrence that needs replaced in > the original string. This seems even less efficient than the regular > expressions, to me. > > Andrew >
Well, OK, I can think of one optimization. This takes advantage of the fact that preg_replace can accept arrays as parameters. In a couple very quick tests this version is roughly 30% faster than my previous version: <?php function highlight_search_terms2($phrase, $string) { $non_letter_chars = '/[^\pL]/iu'; $words = preg_split($non_letter_chars, $phrase); $search_words = array(); foreach ($words as $word) { if (strlen($word) > 2 && !preg_match($non_letter_chars, $word)) { $search_words[] = $word; } } $search_words = array_unique($search_words); $patterns = array( /* repeat for each possible accented character */ '/(ae|æ|ǽ)/iu' => '(ae|æ|ǽ)', '/(oe|œ)/iu' => '(oe|œ)', '/[aàáâãäåǻāăą](?!e)/iu' => '[aàáâãäåǻāăą]', '/[cçćĉċč]/iu' => '[cçćĉċč]', '/[dďđ]/iu' => '[dďđ]', '/(?<![ao])[eèéêëēĕėęě]/iu' => '[eèéêëēĕėęě]', '/[gĝğġģ]/iu' => '[gĝğġģ]', '/[hĥħ]/iu' => '[hĥħ]', '/[iìíîïĩīĭįı]/iu' => '[iìíîïĩīĭįı]', '/[jĵ]/iu' => '[jĵ]', '/[kķĸ]/iu' => '[kķĸ]', '/[lĺļľŀł]/iu' => '[lĺļľŀł]', '/[nñńņňʼnŋ]/iu' => '[nñńņňʼnŋ]', '/[oòóôõöōŏőǿơ](?!e)/iu' => '[oòóôõöōŏőǿơ]', '/[rŕŗř]/iu' => '[rŕŗř]', '/[sśŝşš]/iu' => '[sśŝşš]', '/[tţťŧ]/iu' => '[tţťŧ]', '/[uùúûüũūŭůűųǔǖǘǚǜ]/iu' => '[uùúûüũūŭůűųǔǖǘǚǜ]', '/[wŵ]/iu' => '[wŵ]', '/[yýÿŷ]/iu' => '[yýÿŷ]', '/[zźżž]/iu' => '[zźżž]', ); foreach ($search_words as $word) { $search = preg_quote($word); $search = preg_replace(array_keys($patterns), $patterns, $search); $string = preg_replace('/\b' . $search . '(e?s)?\b/iu', '<span class="keysearch">$0</span>', $string); } return $string; } ?> Andrew