Hi Mark,

When you say "BlockPostings for all index files", this is not correct
in the sense that the Block postings format is just a postings format
and not a codec (SimpleText is both a postings format and a codec --
which itself uses the SimpleText postings format). A codec describes
the formats to use for every index file: postings format, stored
fields format, term vectors format, norms format, etc. whereas a
postings format only describes the format of the terms dictionary and
postings lists. For example, we have the Lucene42Codec which encodes
data efficiently in a binary format on disk and the SimpleTextCodec
which encodes data using text (useful for learning about how Lucene
stores data on disk).

If you want to change any format of a codec, you generally need to
define a new codec. There are two exceptions to this rule in the
default codec: postings format and doc values formats: since there are
good reasons why someone would like to use different formats on a
per-field basis (primary keys, low-frequency fields, ...),
Lucene42Codec makes it easy to use different formats for different
fields.

On Thu, Mar 28, 2013 at 12:08 AM, Mark Bennett
<[email protected]> wrote:
> Question: Can SimpleText even be used for the other binary files in an
> index?  Or is it somehow specific in scope to field tokens?

Yes it can. SimpleText is a fully working codec (but don't use it in
production, it will be insanely slow).

> Question: If it can be used for all the other files, what's the setting for
> that?  I had seen a switch -Dtests.codec=SimpleText in the old instructions,
> but clearly that's for unit tests, and wasn't sure of it's scope or
> applicability.

Although postings formats can be configured in the schema, changing
the codec is a little harder: you need to define a CodecFactory and
configure it in your solrconfig.xml (see
http://wiki.apache.org/solr/SolrConfigXml#codecFactory).

> Are there rules about which codecs can be used where?

Not really, codecs should be interchangeable. About postings formats,
we have some of them which are highly optimized for specific cases
(BloomFilter for primary keys and Pulsing for low-frequency terms,
Memory if you can afford the RAM to make search faster) but the
default postings format already performs very well for most cases,
even primary keys since it "pulses" terms that have docFreq=1,
similarly to Pulsing.

> Can you mix and match codes?  Can you chain them?

Codecs can't be chained. Some postings formats can: for example our
BloomFilter postings format can wrap any other postings format.

> I also saw the FilterCodec javadoc.  Would I only use that if I want to
> reuse most of an existing code, but alter just one part of it?

Exactly. For example if you're not happy with compressed stored
fields, you could use this FilterCodec class to define a new codec
that would use the same formats as Lucene42Codec for everything but
stored fields.

> I'm a bit
> fuzzy combining that with other codes.  If there's a java command line -D
> switch that tells the system to use a different (but already existing) code,
> then I don't think I'd need this at all?

There is no such switch as far as I know.

I hope this helps. This might sound a bit complicated, but the
important thing to know is that we try to make the default codec as
good as possible for most use-cases. Another argument in favor of it
is that Lucene only guarantees backwards compatibility for the default
postings formats and codecs. People that start using non-default
formats might need to perform index format migrations when upgrading
to a future version of Lucene.

-- 
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to