[
https://issues.apache.org/jira/browse/LUCENE-3490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139694#comment-13139694
]
Uwe Schindler commented on LUCENE-3490:
---------------------------------------
We should make the API like for Charsets, just replace Charset by Codec :-)
The Codec class gets a static method "Codec.forName(String name)", that is
invoked by SegmentReader and returns the Codec instance with that name.
Internally it uses
java.util.ServiceLoader<org.apache.lucene.index.codec.spi.CodecProvider> to
lookup codecs. Every JAR file could implement one or more
org.apache.lucene.index.codecs.spi.CodecProvider that supply a lookup method
and iterator (like spi.CharsetProvider). CodecProvider is just an internal
class (expert) that needs to be implemented by JAR file manufacturers. For
Lucene there would be one impl named CoreCodecProvider, the resource file would
be located at
src/resources/META-INF/services/org.apache.lucene.index.codecs.spi.CodecProvider
containing only the binary class name of CoreCodecProvider.
Contrib/misc would provide one of those files at same location, but listing
ContribMiscCodecProvider as impl. We can do the same for PostingsFormats and
other things in the index that needs to be looked up by name. This could be
done all by CodecProvider, it could also provide lookup methods for
postings,.... (this minimizes implementation cost).
If we want to keep Preflex codec read-only in core, but RW in test-framework,
the trick is simple: Just list test-framework classes/JAR before the core
classes/JAR in classpath -> the test-framework CodecProvider would be seen
first by ServiceLoader and take care of name "preflex", core's CodecProvider
would not even be asked (as it comes later in classpath).
> Restructure codec hierarchy
> ---------------------------
>
> Key: LUCENE-3490
> URL: https://issues.apache.org/jira/browse/LUCENE-3490
> Project: Lucene - Java
> Issue Type: Improvement
> Reporter: Robert Muir
> Fix For: 4.0
>
>
> Spinoff of LUCENE-2621. (Hoping we can do some of the renaming etc here in a
> rote way to make progress).
> Currently Codec.java only represents a portion of the index, but there are
> other parts of the index
> (stored fields, term vectors, fieldinfos, ...) that we want under codec
> control. There is also some
> inconsistency about what a Codec is currently, for example Memory and Pulsing
> are really just
> PostingsFormats, you might just apply them to a specific field. On the other
> hand, PreFlex actually
> is a Codec: it represents the Lucene 3.x index format (just not all parts
> yet). I imagine we would
> like SimpleText to be the same way.
> So, I propose restructuring the classes so that we have something like:
> * CodecProvider <-- codec name to Class resolution only
> * Codec <-- represents the index format (PostingsFormat + FieldsFormat + ...)
> * PostingsFormat: this is what Codec controls today, and Codec will return
> one of these for a field.
> * FieldsFormat: Stored Fields + Term Vectors + FieldInfos?
> I think for PreFlex, it doesnt make sense to expose its PostingsFormat as a
> 'public' class, because preflex
> can never be per-field so there is no use in allowing you to configure
> PreFlex for a specific field.
> Similarly, I think we should do the same thing for SimpleText. Nobody needs
> SimpleText for production, it should
> just be a Codec where we try to make as much of the index as plain text and
> simple as possible for debugging/learning/etc.
> So we don't need to expose its PostingsFormat. On the other hand, I don't
> think we need Pulsing or Memory codecs,
> because its pretty silly to make your entire index use one of their
> PostingsFormats. To parallel with analysis:
> PostingsFormat is like Tokenizer and Codec is like Analyzer, and we don't
> need Analyzers to "show off" every Tokenizer.
> Later, once we abstract FieldInfos reading/writing out of o.a.l.index into
> codec control, we can also then
> move the baked in PerFieldCodecWrapper out (it would basically be
> PerFieldPostingsFormat). Privately it would
> write the ids to the file like it does today. all 3.x hairy backwards code
> would move to PreflexCodec. SimpleTextCodec
> would get a plain text fieldinfos impl, etc.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]