Re: storing term text internally as byte array and bytecount as prefix, etc.

Chuck Williams Tue, 02 May 2006 04:20:07 -0700

Hi Jian,

I agree with you about Microsoft.  It's a standard ploy to put window
dressing on stuff to combat competition, in this case from the open
document standard.


So the UTF-8 concern is interoperability with other programs at the
index level.  An interesting question here is whether the Lucene index
format should be considered an API for this purpose.  Most software
systems choose not to publish their internal representations due to
upward compatibility and other concerns, choosing to provide an API
instead.  This is true for databases as well, for example.

Lucene does publish the index formats, so perhaps they are supposed to
be a full API.  If so, it seems to be to not be difficult to use an
analogous few lines of code in other languages as Lucene uses in
IndexInput.readChars() and IndexOutput.writeChars().  The main Lucene
code tree is in Java, which uses the modified UTF-8 encoding. 
Compatibility with that seems most important to me.

One benefits of using Java's modified encoding is to be able to
determine the number of encoded characters from the length of a String
prior to encoding it.  Without this, writeString() would need to write
the string length after writing the encoded characters, which would
involve two extra seeks and could slow things down considerably, unless
there is a different approach that avoids this.

For lazy fields, there would be a substantial benefit to having the
count on a String be an encoded byte count rather than a Java char
count, but this has the same problem.  If there is a way to beat this
problem, then I'd start arguing for a byte count.

Chuck


jian chen wrote on 05/01/2006 06:23 PM:
> Plus, as open source and open standard advocates, we don't want to be
> like
> Micros$ft, who claims to use industrial "standard" XML as the next
> generation word file format. However, it is very hard to write your
> own Word
> reader, because their word file format is proprietary and hard to write
> programs for.
>
> Jian
>
> On 5/1/06, jian chen <[EMAIL PROTECTED]> wrote:
>>
>> Hi, Chuck,
>>
>> Using standard UTF-8 is very important for Lucene index so any program
>> could read the Lucene index easily, be it written in perl, c/c++ or
>> any new
>> future programming languages.
>>
>> It is like storing data in a database for web application. You want to
>> store it in such a way that other programs can manipulate easily
>> other than
>> only the web app program. Because there will be cases that you want
>> to mass
>> update or mass change the data, and you don't want to write only web
>> apps
>> for doing it, right?
>>
>> Cheers,
>>
>> Jian
>>
>>
>>
>> On 5/1/06, Chuck Williams <[EMAIL PROTECTED]> wrote:
>> >
>> > Could someone summarize succinctly why it is considered a major issue
>> > that Lucene uses the Java modified UTF-8 encoding within its index
>> > rather than the standard UTF-8 encoding.  Is the only concern
>> > compatibility with index formats in other Lucene variants?  The API to
>> > the values is a String, which uses Java's char representation, so I'm
>> > confused why the encoding in the index is so important.
>> >
>> > One possible benefit of a standard UTF-8 index encoding would be
>> > streaming content into and out of the index with no copying or
>> > conversions.  This relates to the lazy field loading mechanism.
>> >
>> > Thanks for any clarification,
>> >
>> > Chuck
>> >
>> >
>> > jian chen wrote on 05/01/2006 04:24 PM:
>> > > Hi, Marvin,
>> > >
>> > > Thanks for your quick response. I am in the camp of fearless
>> > refactoring,
>> > > even at the expense of breaking compatibility with previous
>> releases.
>> > ;-)
>> > >
>> > > Compatibility aside, I am trying to identify if changing the
>> > > implementation
>> > > of Term is the right way to go for this problem.
>> > >
>> > > If it is, I think it would be worthwhile rather than putting
>> band-aid
>> > > on the
>> > > existing API.
>> > >
>> > > Cheers,
>> > >
>> > > Jian
>> > >
>> > > Changing the implementation of Term
>> > >> would have a very broad impact; I'd look for other ways to go about
>> > >> it first.  But I'm not an expert on SegmentMerger, as KinoSearch
>> > >> doesn't use the same technique for merging.
>> > >>
>> > >> My plan was to first submit a patch that made the change to the
>> file
>> > >> format but didn't touch SegmentMerger, then attack SegmentMerger
>> and
>> > >> also see if other developers could suggest optimizations.
>> > >>
>> > >> However, I have an awful lot on my plate right now, and I basically
>> > >> get paid to do KinoSearch-related work, but not Lucene-related
>> work.
>> > >> It's hard for me to break out the time to do the java coding,
>> > >> especially since I don't have that much experience with java and
>> I'm
>> > >> slow.  I'm not sure how soon I'll be able to get back to those
>> > >> bytecount patches.
>> > >>
>> > >> Marvin Humphrey
>> > >> Rectangular Research
>> > >> http://www.rectangular.com/
>> > >>
>> > >
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [EMAIL PROTECTED]
>> > For additional commands, e-mail: [EMAIL PROTECTED]
>> >
>> >
>>
>


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: storing term text internally as byte array and bytecount as prefix, etc.

Reply via email to