On Wed, 2011-11-02 at 11:29 -0700, Marvin Humphrey wrote:
> Maybe we should consider scanning incoming fields for UTF-8 sanity after all.
> I don't like making everybody pay this penalty -- small though it is --
> because you'll only get bad UTF-8 if your indexing setup is broken somehow.
> On the other hand, I don't like that once a single bad UTF-8 sequence makes it
> through a commit, the index is irretrievably corrupt -- and you only discover
> that after the damage is done.

Perhaps the sanity checking could be controlled by an option that
defaults to 'on'.  Then people who *know* their setup is UTF-8 clean can
call something like $indexer->no_validate_utf8() to avoid the
performance penalty.

Cheers
Grant

Reply via email to