Hi André,
André, are you still planning a Zend_Utf8 class with different
functionality than the UTF8 helper class for Zend_Locale*?
I know the i18n/locale team is working on a flyweight UTF8 helper class
for private use by Zend_Locale* related classes. This helper class will
include only the function absolutely needed in order for the locale
classes to work. For those new to the ZF, a few weeks ago, after a long
discussion on this list, we decided to not attempt to duplicate the UTF8
functionality coming in PHP6, and not attempt to make the entire ZF work
with UTF8 strings (note: mbstring extension helps with UTF8).
http://framework.zend.com/wiki/display/ZFDEV/i18n+Locale+Team
Cheers,
Gavin
Alexander Veremyev wrote:
Yes. That's a problem.
Hm... Two solutions may be here.
1. Wait until Zend_UTF8 may help with this.
2. Move translation (current work around) to other place to keep
stored fields unchanged.
What is better???
With best regards,
Alexander Veremyev.
Christer Edvartsen wrote:
Converting to ASCII//TRANSLIT is done in the Zend_Search_Lucene_Field
constructor as far as I can see, so what I have to do is to convert
the search string in the same fashion and then convert the search
hits before I display them. This is where I start getting problems.
If I do a var_dump(iconv('ISO-8859-1', 'ASCII//TRANSLIT',
'æ,ø,å,Æ,Ø,Å')); I get string(13) "ae,o,a,AE,O,A"
The problem is that I can not seem to be able to translate the search
hits back to ISO-8859-1 to get back my precious norwegian characters.
Any tips?
Alexander Veremyev wrote:
Hi Christer,
UTF-8 can be completely handled with 'ascii//translit' conversion.
Take a look at
http://framework.zend.com/manual/en/zend.search.charset.html
iconv('ISO-8859-1', 'ASCII//TRANSLIT', $docText) converts umlauts to
two-symbol representation.
Ex. ü -> ue, æ -> ae, å -> aa, ö -> oe.
(I am not sure on ø)
Thus 'für' will be translated to 'fuer'.
If the same translation is applied to search query, you will get
search result as expected.
I don't like this solution, but it works.
Zend_Search_Lucene completely supports utf-8 internally (for index
files), but the problem is in the document tokenizer and query parser.
We need utf-8 versions of ctype_alphe()/ctype_digit() functions
(mbstring extension can't help with this).
As I see Zend_UTF8 can help with this (http://www.utf8-chartable.de/
can give this information). And, I hope, will do :)
(There are no performance issues for Zend_Search_Lucene)
With best regards,
Alexander Veremyev.
Christer Edvartsen wrote:
I guess the main problem is that utf8 is not fully implemented
yet... Maybe you know some more about when this will happen? Could
you also give me some tips about how to handle the characters I am
having problems with? (æ, ø and å in ISO-8859-1)
Alexander Veremyev wrote:
Hi Facundo,
I think that we have not a lot of discussions, because everything
is almost clear there.
It's just a port. We only should move functionality from Java
Lucene with enough accurate and understand, when we should stop :)
But if you have any thoughts, you are welcome!
I heard, that it's used in some projects now, but don't know
details. That would be great to find it out.
As I see Zend_Search_Lucene is stable enough and I work on
automatic index optimization just now.
It will allow to be independent from Java tools (ex. Luke tool)
and also will close memory usage issue
(http://framework.zend.com/issues/browse/ZF-88).
With best regards,
Alexander Veremyev.
Facundo Pagani wrote:
Hi there ppl!
What about Zend_Search_Lucene? I dont see any1 talking about it
... Has any1 doing some serious/production work/project with it?
Can u share ur xperiences?
Be in touch!
Thanks in advance.
--
---------------------------------------------------
Facundo M. Pagani
Ingeniería | Sectorial de Informática
Ministerio de Hacienda y Finanzas
Santa Fe - (C.P.3000 ) - Argentina