On Wed, Nov 2, 2011 at 11:29 AM, Marvin Humphrey <[email protected]> wrote: > What do you mean by "broken source index"? Corrupt because bad UTF-8 snuck > in, and now it refuses to be read? > > Maybe we should consider scanning incoming fields for UTF-8 sanity after all. > I don't like making everybody pay this penalty -- small though it is -- > because you'll only get bad UTF-8 if your indexing setup is broken somehow. > On the other hand, I don't like that once a single bad UTF-8 sequence makes it > through a commit, the index is irretrievably corrupt -- and you only discover > that after the damage is done.
This seems like good practice. I don't know the exact routine, but the performance impact has to be minimal. If it's already in processor cache, any single pass through the string will be almost free: it's already in cache, and I can't believe this step is CPU limited. If you want, you could make it be Safe by default and Risky by explicit option, but you might test first to be sure you even need the option. --nate ps. I came across this possibly relevant discussion of a Perl 'feature' I wasn't aware of: http://jeremy.zawodny.com/blog/archives/010546.html
