On Sun, Apr 12, 2009 at 9:08 AM, Marvin Humphrey <[email protected]> wrote:
>> Does FieldSpec sub divide the options? Eg options about indexing
>> could live in its own class, with commonly used constants like "NO".
>>
>> This was the motivation of that comment in Lucene (the fact that we
>> don't subdivide means suddenly stored only fields have to figure out
>> what to do with omitNorms, omitTFAP booleans; if we had Field.Index.NO
>> that's be better).
>
> Right now, FieldSpec doesn't subdivide, but it's not a least common
> denominator, either. To illustrate: FieldSpec has boolean members for
> "indexed", "stored", and "sortable", but knows nothing about Analyzers.
> Analyzers are the exclusive province of the FullTextField subclass.
OK.
> If you don't permit automatic merging of field types, then there isn't a need
> for FieldSpec to know everything about all its subclasses.
I think Lucene could continue to merge yet isolate information
(subdivision, subclassing). At least I sure hope so :)
> I see why subdividing options might be useful in Lucene, but I'm not
> sure it's necessary for Lucy.
It's all still hazy to me :) Hopefully once we talk about it enough
I'll get some clarity... it is sort of scary that we're inventing a
type system.
EG there are many things the FieldType should somehow tell us:
* How does FieldSpec model "multi-valued" fields? Is there a
boolean in the base class?
* Must not be null -- base class?
* "Has only one token" -- I guess this is implied by the class (ie
only FullTextType may have > 1 token)
* Open vs closed (known set of values) enums
* Sortable
* nulls sort on top or bottom
* Omit norms, omit TFAP
* Binary or not (I guess BlobType <-> binary)
* Term vectors or not, positions, offsets
* Stored or not -- toplevel?
* CSF'd or not
* ValueSource is XYZ for this field
* I will use RangeFilter on this field
* Analyzer to use (exposed only FullTextType)
* Extensibility -- so app can enroll new attrs / make new type
subclasses
> I think it's better OO design for the parent class to be simple rather than
> comprehensive.
>
>> Well, in Lucene we could better decouple a Field's value from its
>> "extended type". The type would still be attached to the Field's
>> value (not to the global schema as in KS), but strongly decoupled &
>> shared across Field instances.
>
> That makes sense. The "extended type" class could look almost identical, but
> in Lucene the user would make the connection directly, while in Lucy it
> would be made indirectly via the field name.
Right.
>> > Dump them to a JSON-izable data structure. Include the class name so that
>> > you
>> > can pick a deserialization routine at load time.
>>
>> You rely on the same namespace -> obj mapping being present at
>> deserialize time? Ie its the callers responsibility to import the
>> same modules, ensure the names "map" to the same objs (or at least
>> compatible ones) as were used during serialization, etc.
>
> If the user has implemented custom subclasses, then yes, the subclasses must
> be
> loaded or you'll get a "class not found" error.
OK just like unpickling in Python...
Remind me again: do custom subclasses get enrolled into the global
hash in Lucy's core? I know you had said it's a thread risk, ie, not
read only...
>> Though, for core objects, you would use the global name -> vtable
>> mapping that Lucy core maintains?
>
> Yes. Any core class would already be loaded.
>
>> (I still don't fully understand why Lucy needs that global hash -- this is
>> what namespaces are for).
>
> If we didn't implement it internally, we'd need to implement it in the
> bindings for e.g. looking up deserialization routines. Furthermore, we need
> some mechanism for C-level subclassing, since that's not part of the C
> language. No namespaces there. :)
I'm still confused. Say StandardAnalyzer is implemented in C; maybe
you'd name it Lucy_Analysis_StandardAnalyzer (since C doesn't support
namespaces you put prefixes in front).
Any time something in core wants to use that class, it refers to it by
name (and the C compiler/linker maps it), not via the global hash?
But for deserializing a core object, when the deserializer is
implemented in C, I agree you'd need a global lookup; basically
because you can't consult the OBJ's symbol table dynamically. (If you
have a hosty deserializer, then it would "import lucy; lucy.XXX" to
find its classes).
(But it seems like that global hash should be readonly-able).
>> OK, so if I've made a custom Tokenizer doing some funky Python code
>> instead of a regexp, I could simply implement dump/load to do the
>> right thing.
>
> Yes.
>
> BTW, I saw that Earwin Burrfoot calls his type class "FieldType".
>
> "FieldType" is probably a better name than "FieldSpec", as it implies
> subclasses with "Type" as a suffix: FullTextType, StringType, BlobType,
> Int32Type, etc.
Agreed.
Mike