Re: Types and Schemas (was "Sort cache file format")

Michael McCandless Sun, 12 Apr 2009 12:16:48 -0700

On Sun, Apr 12, 2009 at 9:08 AM, Marvin Humphrey <[email protected]> wrote:


>> Does FieldSpec sub divide the options?  Eg options about indexing
>> could live in its own class, with commonly used constants like "NO".
>>
>> This was the motivation of that comment in Lucene (the fact that we
>> don't subdivide means suddenly stored only fields have to figure out
>> what to do with omitNorms, omitTFAP booleans; if we had Field.Index.NO
>> that's be better).
>
> Right now, FieldSpec doesn't subdivide, but it's not a least common
> denominator, either.  To illustrate: FieldSpec has boolean members for
> "indexed", "stored", and "sortable", but knows nothing about Analyzers.
> Analyzers are the exclusive province of the FullTextField subclass.

OK.

> If you don't permit automatic merging of field types, then there isn't a need
> for FieldSpec to know everything about all its subclasses.

I think Lucene could continue to merge yet isolate information
(subdivision, subclassing).  At least I sure hope so :)

> I see why subdividing options might be useful in Lucene, but I'm not
> sure it's necessary for Lucy.

It's all still hazy to me :) Hopefully once we talk about it enough
I'll get some clarity... it is sort of scary that we're inventing a
type system.

EG there are many things the FieldType should somehow tell us:

  * How does FieldSpec model "multi-valued" fields?  Is there a
    boolean in the base class?

  * Must not be null -- base class?

  * "Has only one token" -- I guess this is implied by the class (ie
    only FullTextType may have > 1 token)

  * Open vs closed (known set of values) enums

  * Sortable

  * nulls sort on top or bottom

  * Omit norms, omit TFAP

  * Binary or not (I guess BlobType <-> binary)

  * Term vectors or not, positions, offsets

  * Stored or not -- toplevel?

  * CSF'd or not

  * ValueSource is XYZ for this field

  * I will use RangeFilter on this field

  * Analyzer to use (exposed only FullTextType)

  * Extensibility -- so app can enroll new attrs / make new type
    subclasses

> I think it's better OO design for the parent class to be simple rather than
> comprehensive.
>
>> Well, in Lucene we could better decouple a Field's value from its
>> "extended type".  The type would still be attached to the Field's
>> value (not to the global schema as in KS), but strongly decoupled &
>> shared across Field instances.
>
> That makes sense.  The "extended type" class could look almost identical, but
> in Lucene the user would make the connection directly, while in Lucy it
> would be made indirectly via the field name.

Right.

>> > Dump them to a JSON-izable data structure.  Include the class name so that 
>> > you
>> > can pick a deserialization routine at load time.
>>
>> You rely on the same namespace -> obj mapping being present at
>> deserialize time?  Ie its the callers responsibility to import the
>> same modules, ensure the names "map" to the same objs (or at least
>> compatible ones) as were used during serialization, etc.
>
> If the user has implemented custom subclasses, then yes, the subclasses must 
> be
> loaded or you'll get a "class not found" error.

OK just like unpickling in Python...

Remind me again: do custom subclasses get enrolled into the global
hash in Lucy's core?  I know you had said it's a thread risk, ie, not
read only...

>> Though, for core objects, you would use the global name -> vtable
>> mapping that Lucy core maintains?
>
> Yes.  Any core class would already be loaded.
>
>> (I still don't fully understand why Lucy needs that global hash -- this is
>> what namespaces are for).
>
> If we didn't implement it internally, we'd need to implement it in the
> bindings for e.g. looking up deserialization routines.  Furthermore, we need
> some mechanism for C-level subclassing, since that's not part of the C
> language.  No namespaces there.  :)

I'm still confused.  Say StandardAnalyzer is implemented in C; maybe
you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support
namespaces you put prefixes in front).

Any time something in core wants to use that class, it refers to it by
name (and the C compiler/linker maps it), not via the global hash?

But for deserializing a core object, when the deserializer is
implemented in C, I agree you'd need a global lookup; basically
because you can't consult the OBJ's symbol table dynamically.  (If you
have a hosty deserializer, then it would "import lucy; lucy.XXX" to
find its classes).

(But it seems like that global hash should be readonly-able).

>> OK, so if I've made a custom Tokenizer doing some funky Python code
>> instead of a regexp, I could simply implement dump/load to do the
>> right thing.
>
> Yes.
>
> BTW, I saw that Earwin Burrfoot calls his type class "FieldType".
>
> "FieldType" is probably a better name than "FieldSpec", as it implies
> subclasses with "Type" as a suffix: FullTextType, StringType, BlobType,
> Int32Type, etc.

Agreed.

Mike

Re: Types and Schemas (was "Sort cache file format")

Reply via email to