Re: [fw-general] Zend_Seach_Lucene and UTF-8

Thomas Weidner Fri, 22 Sep 2006 09:30:52 -0700

Hy,

Why not do a


$string = preg_replace('æ', '&#0xE6', $string);

and afterwards a

$string = preg_replace('&#0xE6', 'æ', $string);


Greetings
Thomas

----- Original Message -----From: "Christer Edvartsen" <[EMAIL PROTECTED]>

To: "Alexander Veremyev" <[EMAIL PROTECTED]>
Cc: "Facundo Pagani" <[EMAIL PROTECTED]>; <[email protected]>
Sent: Friday, September 22, 2006 5:13 PM
Subject: Re: [fw-general] Zend_Seach_Lucene and UTF-8

Converting to ASCII//TRANSLIT is done in the Zend_Search_Lucene_Fieldconstructor as far as I can see, so what I have to do is to convert thesearch string in the same fashion and then convert the search hits beforeI display them. This is where I start getting problems.
If I do a var_dump(iconv('ISO-8859-1', 'ASCII//TRANSLIT', 'æ,ø,å,Æ,Ø,Å'));I get string(13) "ae,o,a,AE,O,A"
The problem is that I can not seem to be able to translate the search hitsback to ISO-8859-1 to get back my precious norwegian characters. Any tips?
Alexander Veremyev wrote:
Hi Christer,

UTF-8 can be completely handled with 'ascii//translit' conversion.
Take a look at
http://framework.zend.com/manual/en/zend.search.charset.html
iconv('ISO-8859-1', 'ASCII//TRANSLIT', $docText) converts umlauts totwo-symbol representation.
Ex. ü -> ue, æ -> ae, å -> aa, ö -> oe.
(I am not sure on ø)

Thus 'für' will be translated to 'fuer'.
If the same translation is applied to search query, you will get searchresult as expected.
I don't like this solution, but it works.
Zend_Search_Lucene completely supports utf-8 internally (for indexfiles), but the problem is in the document tokenizer and query parser.
We need utf-8 versions of ctype_alphe()/ctype_digit() functions (mbstringextension can't help with this).
As I see Zend_UTF8 can help with this (http://www.utf8-chartable.de/ cangive this information). And, I hope, will do :)
(There are no performance issues for Zend_Search_Lucene)


With best regards,
   Alexander Veremyev.




Christer Edvartsen wrote:
I guess the main problem is that utf8 is not fully implemented yet...Maybe you know some more about when this will happen? Could you alsogive me some tips about how to handle the characters I am havingproblems with? (æ, ø and å in ISO-8859-1)
Alexander Veremyev wrote:
Hi Facundo,
I think that we have not a lot of discussions, because everything isalmost clear there.It's just a port. We only should move functionality from Java Lucenewith enough accurate and understand, when we should stop :)
But if you have any thoughts, you are welcome!
I heard, that it's used in some projects now, but don't know details.That would be great to find it out.
As I see Zend_Search_Lucene is stable enough and I work on automaticindex optimization just now.It will allow to be independent from Java tools (ex. Luke tool) andalso will close memory usage issue(http://framework.zend.com/issues/browse/ZF-88).
With best regards,
   Alexander Veremyev.


Facundo Pagani wrote:
Hi there ppl!
What about Zend_Search_Lucene? I dont see any1 talking about it ...Has any1 doing some serious/production work/project with it? Can ushare ur xperiences?
Be in touch!
Thanks in advance.

--
---------------------------------------------------
Facundo M. Pagani
Ingeniería | Sectorial de Informática
Ministerio de Hacienda y Finanzas
Santa Fe - (C.P.3000 ) - Argentina
--
mvh
Christer Edvartsen

"I will not skateboard in the halls"

Re: [fw-general] Zend_Seach_Lucene and UTF-8

Reply via email to