Re: Sort cache file format

Marvin Humphrey Fri, 10 Apr 2009 14:39:09 -0700

On Fri, Apr 10, 2009 at 09:52:01AM -0400, Michael McCandless wrote:

> >   Schema        schema     = new Schema();
> >   PolyAnalyzer  analyzer   = new PolyAnalyzer("en");
> >   FullTextField fulltext   = new FullTextField(analyzer);
> >   StringField   notIndexed = new StringField();
> >   notIndexed.setIndexed(false);
> >   schema.specField("title",    fulltext);
> >   schema.specField("content",  fulltext);
> >   schema.specField("url",      notIndexed);
> >   schema.specField("category", new StringField());


> So in this code, FullTextField is an instance (subclass?) of FieldSpec?

FieldSpec is an abstract base class.  FullTextField is a subclass of FieldSpec,
as is StringField.

> It seems like a FieldSpec represents the "extended type" of a field
> (ie, includes all juicy details about how it should be indexed
> (including an analyzer instance), stored, etc), and then you're free
> to have more than one field in your doc share that "extended type".

Yes.

> So things like "I intend to do range searching" and "I intend to sort"
> this field, and "I want to store term vectors, with positions but not
> offsets", etc., belong in FieldSpec?

Yes.

> Hmmm... actually I sort of talked about the beginnings of this, for
> Lucene, in the last paragraph here:
> 
>   
> https://issues.apache.org/jira/browse/LUCENE-1590?focusedCommentId=12696994&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12696994

Yes, that seems to be the same thing.

> Ie, maybe we should instantiate Field.Index and tweak its options
> (norms, omitTFAP, etc.) and that instance becomes the type of your
> field (at least wrt indexing).

That's sort of the idea, but Field.Index is pretty limited in its options.
FieldSpec does a lot more.  

As we have seen, FieldSpec is responsible for associating Analyzers with
fulltext fields.  In Lucene, you have to do that via IndexWriter, QueryParser,
PerFieldAnalyzerWrapper, and probably a few others I've forgotten.

FieldSpec is similarily responsible for associating Similarity instances and
posting formats with field names (as appropriate).  Looking forward, sort
comparators also belong in FieldSpec.  And so on.

> This is also sort of like the crazy static types one can create with
> generics, ie, a "type" used to be something nice simple (int, float,
> your own class, etc.) but now can be a rich object (instance) in
> itself.  

The FieldSpec approach is actually quite similar to the "flyweight" pattern.  

>From <http://en.wikipedia.org/wiki/Flyweight_pattern>:

    A classic example usage of the flyweight pattern are the data structures for
    graphical representation of characters in a word processor. It would be nice
    to have, for each character in a document, a glyph object containing its 
font
    outline, font metrics, and other formatting data, but it would amount to
    hundreds or thousands of bytes for each character. Instead, for every
    character there might be a reference to a flyweight glyph object shared by
    every instance of the same character in the document; only the position of
    each character (in the document and/or the page) would need to be stored
    externally.

In Lucy, Docs will be hash-based (rather than array-based as in Lucene) -- so
each field value will be associated with a single field name.  When the
document is submitted for indexing, we use the field name to associate the
value with a FieldSpec object.  Making that association attaches a bunch of
traits and behaviors to the value: whether it should be indexed, how it should
sort, whether it should be stored and how it should be encoded when it is
stored, etc.

So, the difference is that in Lucene, every field value is an object with an
arbitrary set of traits and behaviors, while in Lucy, values for a given field
will have a uniform type.  

> [A class and an instance really should not be different,
> anyway (prototype languages like Self don't differentiate).]

Haven't used Self, but I've done plenty of JavaScript programming, so I think
I can comment.

In general, I don't think there's a way to implement the "objects are classes"
model without making every object gigantic.  I mean, you're not so much
merging the "class" and "instance" concepts so much as you are eliminating all
class data and shoving everything down into the object.  But sharing class
data is highly efficient in many, many situations.  Why make every character
in a word processing document a gigantic object?

With regards to fields and field values in Lucene and Lucy: Allowing
individual field values to define their own behaviors is insane.  There are
many high level objects which must act on groups of values.   The "freedom"
that fields have to "morph" isn't free, because the high level objects can no
longer know so much about the values, and thus must interact with those values
in more indirect and inefficient ways.

> Now that I understand FieldSpec (I think!), I think allowing sharing
> is fine.  Different variables in my source code can share the same
> type...

Just how we structure the data in the serialized schema is a bit of a trick.

> > BTW, in KS svn trunk, Schemas are now fully serialized and written
> > to the index as "schema_NNN.json".  Including Analyzers.  :)
> 
> How do you serialize Analyzers again?  

Dump them to a JSON-izable data structure.  Include the class name so that you
can pick a deserialization routine at load time.  

Here's a PolyAnalyzer example with three sub-analyzers:

      {   
         "_class" : "KinoSearch::Analysis::PolyAnalyzer",
         "analyzers" : [ 
            {   
               "_class" : "KinoSearch::Analysis::CaseFolder"
            },  
            {   
               "_class" : "KinoSearch::Analysis::Tokenizer",
               "pattern" : "\\w+(?:['\\x{2019}]\\w+)*"
            },  
            {   
               "_class" : "KinoSearch::Analysis::Stemmer",
               "language" : "en"
            }   
         ]   
      }  

Stopalizers take up more space because they require serialization of the
stoplist.

In the current KS implementation, Analyzers are required to implement custom
Dump() and Load() methods; Dump creates a JSON-izable data structure, while
Load() creates a new object based on the contents of the dump.

   Analyzer clone = analyzer.load(analyzer.dump());

In the simplest case, a custom Analyzer subclass can implement a no-argument
constructor and call that from Load().

Marvin Humphrey

Re: Sort cache file format

Reply via email to