On Wed, Nov 02, 2011 at 11:59:34AM -0700, Nathan Kurz wrote: > On Wed, Nov 2, 2011 at 11:29 AM, Marvin Humphrey <[email protected]> > wrote: > > What do you mean by "broken source index"? Corrupt because bad UTF-8 snuck > > in, and now it refuses to be read? > > > > Maybe we should consider scanning incoming fields for UTF-8 sanity after > > all. > > I don't like making everybody pay this penalty -- small though it is -- > > because you'll only get bad UTF-8 if your indexing setup is broken somehow. > > On the other hand, I don't like that once a single bad UTF-8 sequence makes > > it > > through a commit, the index is irretrievably corrupt -- and you only > > discover > > that after the damage is done. > > This seems like good practice. I don't know the exact routine, but > the performance impact has to be minimal.
It turns out that the UTF-8 validity checking has been enabled after all -- for several years now. :P For the record, I benchmarked disabling it, and got a speedup on the indexing benchmark by about half a percent. That's pretty dang small, especially since the indexing benchmarker uses an unrealistically simple Analyzer. > ps. I came across this possibly relevant discussion of a Perl > 'feature' I wasn't aware of: > http://jeremy.zawodny.com/blog/archives/010546.html The patch to disable the sanity checking, pasted below my sig, involves changing a method call from "Assign_Str" (which performs a validity check) to "Assign_Trusted_Str" (which trusts that the string is valid and skips the check). I deliberately gave the unsafe method a more cumbersome and unambiguous name so that, so that anybody invoking the "wrong" method would make their error in the "safe" direction -- think of it as "fail-safe" interface design applied to method naming. The primary influence on this design was the negative example set by Perl's lousy UTF-8 input interface, as detailed in that Jeremy Zawodny blog post (which I've read before). I wanted to do the opposite of this: # Short, obvious name is unsafe -- no sanity checking. open( my $fh, '<:utf8', $path ) or die $!; # Long, obscure incantation is safe -- sanity checking is enabled. open( my $fh, '<:encoding(UTF-8)', $path ) or die $!; Marvin Humphrey Index: ../perl/xs/Lucy/Index/Inverter.c =================================================================== --- ../perl/xs/Lucy/Index/Inverter.c (revision 1196798) +++ ../perl/xs/Lucy/Index/Inverter.c (working copy) @@ -102,7 +102,7 @@ char *val_ptr = SvPVutf8(value_sv, val_len); lucy_ViewCharBuf *value = (lucy_ViewCharBuf*)inv_entry->value; - Lucy_ViewCB_Assign_Str(value, val_ptr, val_len); + Lucy_ViewCB_Assign_Trusted_Str(value, val_ptr, val_len); break; } case lucy_FType_BLOB: {
