Re: Term pollution from binary data

Nicolas Lalevée Fri, 09 Nov 2007 05:31:23 -0800

Le jeudi 8 novembre 2007, Michael McCandless a écrit :
> "Doug Cutting" <[EMAIL PROTECTED]> wrote:
> > Aren't indexes loaded lazily?  That's an important optimization for
> > merging, no?  For performance reasons, opening an IndexReader shouldn't
> > do much more than open files.  However, if we build a more generic
> > mechanism, we should not rely on that.
>
> Woops, you are right!  So in this case we could wait until after ctor
> to set the property.  I will take that approach for this, then, so we
> can decouple it from the "generic properties" discussion.  I think,
> also, I will throw an IllegalStateException if you try to set this
> after the index was already loaded.
>
> For other things, eg the DeletionPolicy instance & lock timeout for
> IndexWriter, and infoStream for both IndexWriter & IndexReader, we
> need to use them in the ctor but we don't want to explode the number
> of ctors.  Eg we now have setDefaultLockTimeout/setDefaultInfoStream
> which we could deprecate if we can set this in generic properties
> instead.
>
> > > What if, instead, we passed down a Properties instance to IndexReader
> > > ctors?  Or alternatively a dedicated class, eg,
> > > "IndexReaderInitParameters"?  The advantage of a dedicated class is
> > > it's strongly typed at compile time, and, you could put things in
> > > there like an optional DeletionPolicy instance as well.  I think there
> > > are a growing list of these sorts of "advanced optional parameters
> > > used during init" that could be handled with such an approach?
> >
> > (I probably should have read your entire message before starting to
> > respond...  But it's nice to see that we think alike!)
>
> That is nice!
>
> > This is similar to my (2) approach, but attempts to solve the typing
> > issue, although I'm not sure how...
> >
> > The way we handle it in Hadoop is to pass around a <String,String> map
> > in the abstract kernel, then have concrete implementation classes
> > provide static methods that access it.  So this might look something
> > like:
> >
> > public class LuceneProperties extends Properties {
> >    // utility methods to handle conversion of values to and from Strings
> >    void setInt(String prop, int value);
> >    int getInt(String prop);
> >    void setClass(String prop, Class value);
> >    Class getClass(String prop);
> >    Object newInstance(String prop)
> >    ...
> > }
> >
> > public class SegmentReaderProperties {
> >    private static final String DIVISOR_PROP =
> >      "org.apache.lucene.index.SegmentReader.divisor";
> >    public static setTermIndexDivisor(LuceneProperties props, int i) {
> >      props.setInt(DIVISOR_PROP, i);
> >    }
> > }
> >
> > Then the IndexReader constructor methods could accept a
> > LuceneProperties.  No point in making this IndexReader specific, since
> > it might be useful for, e.g., IndexWriter, Searchers, Directories, etc.
> >
> > An advantage of a <String,String> map over a <String,Object> map for
> > Hadoop is that it's trivial to serialize.
> >
> > Is this what you had in mind?
>
> I like that approach!  I think I'd prefer <String,Object> so we could
> put InfoStream, DeletionPolicy and other class instances in there?
> (Without requiring that they have zero-arg ctors).  Unless there would
> be some reason for Lucene to also need serialization?
>
> (Actually, for infoStream I think eventually we should switch to a
> logging framework).
>
> Hmmm, one wrinkle: when we would "look at" a property?  I guess it's
> per-property.  EG infoStream we could "look at" every time we needed
> to print something to it.  But eg say we have "deletionPolicy" in
> there, and you suddenly change it in your properties, then, when are
> we supposed to notice that and re-init it?  That is a downside vs
> putting set/get on the class directly because with set/get the class
> obviously knows when the property is being changed.
>
> OK, I'm no longer sure this is [yet] necessary for Lucene!  What
> "properties" would we actually want to put here and NOT in the ctors
> or set/gets on the class itself?  It feels like a vanishing set.


And from my point of view as a deep user of the Lucene API, generally I do not 
like generic properties settings because it makes the API undocumented. The 
java doc around the setter and the getter of the property is as usefull as :
/**
 * Set a property
 *
 * @param prop the property to set
 * @param value the value to bind to the property
 */
public void setProperty(String prop, Object value)

Then you get quite lost because you cannot have the exhausive list about the 
properties you can set.
Maybe you, Lucene developpers, can today ensure that the javadoc arount this 
setter will be enougth exhaustive to be useable. But tomorrow, a developper 
adding a new property have to not forgot to update the documentation of the 
generic setter. Even if I think that Lucene developpers are a lot more 
carefull than in some other open source project, nobody is perfect ;)
And having some fields in a java class is not that far harder to maintain I 
think.

I do not much about hadoop, but such interface might be interesting because 
the configuration are send to different remote server. So there should be a 
generic class to not duplicate the serialization code. I don't think Lucene 
should do that kind of thing.

just my 2c.

Nicolas

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Term pollution from binary data

Reply via email to