Re: [lucy-user] Index state during merges

Nathan Kurz Wed, 02 Nov 2011 12:00:22 -0700

On Wed, Nov 2, 2011 at 11:29 AM, Marvin Humphrey <[email protected]> wrote:
> What do you mean by "broken source index"?  Corrupt because bad UTF-8 snuck
> in, and now it refuses to be read?
>
> Maybe we should consider scanning incoming fields for UTF-8 sanity after all.
> I don't like making everybody pay this penalty -- small though it is --
> because you'll only get bad UTF-8 if your indexing setup is broken somehow.
> On the other hand, I don't like that once a single bad UTF-8 sequence makes it
> through a commit, the index is irretrievably corrupt -- and you only discover
> that after the damage is done.


This seems like good practice.  I don't know the exact routine, but
the performance impact has to be minimal.   If it's already in
processor cache, any single pass through the string will be almost
free: it's already in cache, and I can't believe this step is CPU
limited. If you want, you could make it be Safe by default and Risky
by explicit option, but you might test first to be sure you even need
the option.

--nate

ps.  I came across this possibly relevant discussion of a Perl
'feature' I wasn't aware of:
http://jeremy.zawodny.com/blog/archives/010546.html

Re: [lucy-user] Index state during merges

Reply via email to