Hi Mark, When you say "BlockPostings for all index files", this is not correct in the sense that the Block postings format is just a postings format and not a codec (SimpleText is both a postings format and a codec -- which itself uses the SimpleText postings format). A codec describes the formats to use for every index file: postings format, stored fields format, term vectors format, norms format, etc. whereas a postings format only describes the format of the terms dictionary and postings lists. For example, we have the Lucene42Codec which encodes data efficiently in a binary format on disk and the SimpleTextCodec which encodes data using text (useful for learning about how Lucene stores data on disk).
If you want to change any format of a codec, you generally need to define a new codec. There are two exceptions to this rule in the default codec: postings format and doc values formats: since there are good reasons why someone would like to use different formats on a per-field basis (primary keys, low-frequency fields, ...), Lucene42Codec makes it easy to use different formats for different fields. On Thu, Mar 28, 2013 at 12:08 AM, Mark Bennett <[email protected]> wrote: > Question: Can SimpleText even be used for the other binary files in an > index? Or is it somehow specific in scope to field tokens? Yes it can. SimpleText is a fully working codec (but don't use it in production, it will be insanely slow). > Question: If it can be used for all the other files, what's the setting for > that? I had seen a switch -Dtests.codec=SimpleText in the old instructions, > but clearly that's for unit tests, and wasn't sure of it's scope or > applicability. Although postings formats can be configured in the schema, changing the codec is a little harder: you need to define a CodecFactory and configure it in your solrconfig.xml (see http://wiki.apache.org/solr/SolrConfigXml#codecFactory). > Are there rules about which codecs can be used where? Not really, codecs should be interchangeable. About postings formats, we have some of them which are highly optimized for specific cases (BloomFilter for primary keys and Pulsing for low-frequency terms, Memory if you can afford the RAM to make search faster) but the default postings format already performs very well for most cases, even primary keys since it "pulses" terms that have docFreq=1, similarly to Pulsing. > Can you mix and match codes? Can you chain them? Codecs can't be chained. Some postings formats can: for example our BloomFilter postings format can wrap any other postings format. > I also saw the FilterCodec javadoc. Would I only use that if I want to > reuse most of an existing code, but alter just one part of it? Exactly. For example if you're not happy with compressed stored fields, you could use this FilterCodec class to define a new codec that would use the same formats as Lucene42Codec for everything but stored fields. > I'm a bit > fuzzy combining that with other codes. If there's a java command line -D > switch that tells the system to use a different (but already existing) code, > then I don't think I'd need this at all? There is no such switch as far as I know. I hope this helps. This might sound a bit complicated, but the important thing to know is that we try to make the default codec as good as possible for most use-cases. Another argument in favor of it is that Lucene only guarantees backwards compatibility for the default postings formats and codecs. People that start using non-default formats might need to perform index format migrations when upgrading to a future version of Lucene. -- Adrien --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
