Re: storing term text internally as byte array and bytecount as prefix, etc.

Tatu Saloranta Tue, 02 May 2006 14:09:57 -0700

--- jian chen <[EMAIL PROTECTED]> wrote:

> Plus, as open source and open standard advocates, we
> don't want to be like
> Micros$ft, who claims to use industrial "standard"
> XML as the next
> generation word file format. However, it is very
> hard to write your own Word
> reader, because their word file format is
> proprietary and hard to write
> programs for.


Note, though, that "java modified UTF-8" IS the
standard on Java platform (and there are valid reasons
for it using slightly different encoding from
canonical one); so changing it in any way would make
it less standard, not more (within context of java
platform).

Second, unless I'm mistaken, there is nothing special
in java encoding that would make it a problem with,
say, C/C++ implementation. I thought Perl had some
specific problems, since its UTF-8 support is more
hard-coded; whereas it is possible (and not very
difficult) to change char<->UTF-8 serialization, it's
not quite as easy in Perl (at least doing it with any 
reasonable efficiency).

I would actually be more interested in other
performance aspects of avoiding String instantiation:
managing byte arrays directly, and/or using canonical
caching from byte[] directly to Strings can bring
significant performance improvements when
serializing/deserializing tokens to/from disk; at
likely expense of bit more memory usage (Term object
probably should have lazily instantiated
String/byte[], depending on how it was created).

-+ Tatu +-

> 
> Jian
> 
> On 5/1/06, jian chen <[EMAIL PROTECTED]> wrote:
> >
> > Hi, Chuck,
> >
> > Using standard UTF-8 is very important for Lucene
> index so any program
> > could read the Lucene index easily, be it written
> in perl, c/c++ or any new
> > future programming languages.
> >
> > It is like storing data in a database for web
> application. You want to
> > store it in such a way that other programs can
> manipulate easily other than
> > only the web app program. Because there will be
> cases that you want to mass
> > update or mass change the data, and you don't want
> to write only web apps
> > for doing it, right?
> >
> > Cheers,
> >
> > Jian
> >
> >
> >
> > On 5/1/06, Chuck Williams <[EMAIL PROTECTED]>
> wrote:
> > >
> > > Could someone summarize succinctly why it is
> considered a major issue
> > > that Lucene uses the Java modified UTF-8
> encoding within its index
> > > rather than the standard UTF-8 encoding.  Is the
> only concern
> > > compatibility with index formats in other Lucene
> variants?  The API to
> > > the values is a String, which uses Java's char
> representation, so I'm
> > > confused why the encoding in the index is so
> important.
> > >
> > > One possible benefit of a standard UTF-8 index
> encoding would be
> > > streaming content into and out of the index with
> no copying or
> > > conversions.  This relates to the lazy field
> loading mechanism.
> > >
> > > Thanks for any clarification,
> > >
> > > Chuck
> > >
> > >
> > > jian chen wrote on 05/01/2006 04:24 PM:
> > > > Hi, Marvin,
> > > >
> > > > Thanks for your quick response. I am in the
> camp of fearless
> > > refactoring,
> > > > even at the expense of breaking compatibility
> with previous releases.
> > > ;-)
> > > >
> > > > Compatibility aside, I am trying to identify
> if changing the
> > > > implementation
> > > > of Term is the right way to go for this
> problem.
> > > >
> > > > If it is, I think it would be worthwhile
> rather than putting band-aid
> > > > on the
> > > > existing API.
> > > >
> > > > Cheers,
> > > >
> > > > Jian
> > > >
> > > > Changing the implementation of Term
> > > >> would have a very broad impact; I'd look for
> other ways to go about
> > > >> it first.  But I'm not an expert on
> SegmentMerger, as KinoSearch
> > > >> doesn't use the same technique for merging.
> > > >>
> > > >> My plan was to first submit a patch that made
> the change to the file
> > > >> format but didn't touch SegmentMerger, then
> attack SegmentMerger and
> > > >> also see if other developers could suggest
> optimizations.
> > > >>
> > > >> However, I have an awful lot on my plate
> right now, and I basically
> > > >> get paid to do KinoSearch-related work, but
> not Lucene-related work.
> > > >> It's hard for me to break out the time to do
> the java coding,
> > > >> especially since I don't have that much
> experience with java and I'm
> > > >> slow.  I'm not sure how soon I'll be able to
> get back to those
> > > >> bytecount patches.
> > > >>
> > > >> Marvin Humphrey
> > > >> Rectangular Research
> > > >> http://www.rectangular.com/
> > > >>
> > > >
> > >
> > >
> > >
>
---------------------------------------------------------------------
> > > To unsubscribe, e-mail:
> [EMAIL PROTECTED]
> > > For additional commands, e-mail:
> [EMAIL PROTECTED]
> > >
> > >
> >
> 


__________________________________________________
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: storing term text internally as byte array and bytecount as prefix, etc.

Reply via email to