On Wed, 2011-11-02 at 11:29 -0700, Marvin Humphrey wrote: > Maybe we should consider scanning incoming fields for UTF-8 sanity after all. > I don't like making everybody pay this penalty -- small though it is -- > because you'll only get bad UTF-8 if your indexing setup is broken somehow. > On the other hand, I don't like that once a single bad UTF-8 sequence makes it > through a commit, the index is irretrievably corrupt -- and you only discover > that after the damage is done.
Perhaps the sanity checking could be controlled by an option that defaults to 'on'. Then people who *know* their setup is UTF-8 clean can call something like $indexer->no_validate_utf8() to avoid the performance penalty. Cheers Grant
