I started investigating the current state of Lucene's index sorting support
in Solr. I had Claude Opus write a report for me. Rather than hoard it to
myself, I'm sharing with everyone in case others are wondering what's up as
well.
Background
----------
Lucene has supported index-level sorting since LUCENE-6766 (Lucene 6.2),
where segments are internally sorted by a configurable field order at
flush/merge time. This enables significant query-time optimizations --
when the query's sort matches the index sort, Lucene can skip entire
segments or terminate collection early.
Current Solr Support
--------------------
Solr does support index sorting today, but through an indirect mechanism:
Configuration is done via SortingMergePolicyFactory in solrconfig.xml:
<mergePolicyFactory
class="org.apache.solr.index.SortingMergePolicyFactory">
<str name="sort">timestamp desc</str>
<str name="wrapped.prefix">inner</str>
<str
name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
</mergePolicyFactory>
Internally, SortingMergePolicy is a FilterMergePolicy that does nothing
merge-policy-related -- it simply holds a Sort object. SolrIndexConfig
then has a special instanceof check that extracts this Sort and calls
IndexWriterConfig.setIndexSort(). The class itself has a TODO comment
acknowledging this is a workaround: "remove this and add indexSort
specification directly to solrconfig.xml?"
Query-side integration exists via the "segmentTerminateEarly" query
parameter, which wraps the collector in an EarlyTerminatingSortingCollector.
Note that this collector is @Deprecated -- modern Lucene's TopFieldCollector
handles early termination natively when it detects sorted segments.
The /admin/segments API (with coreInfo=true) exposes the indexSort
configuration and per-segment sort info.
AtomicUpdateDocumentMerger correctly detects fields used for index sorting
and prevents DocValues-only updates on them (a Lucene limitation).
Open Issues
-----------
Several open JIRA issues relate to this area.
SOLR-9108: Improve how index time sorting is configured
https://issues.apache.org/jira/browse/SOLR-9108
Filed by Mike McCandless in 2016 right after LUCENE-6766. Proposes
configuring index sort directly in solrconfig.xml alongside other
IndexWriter settings rather than piggybacking on the merge policy.
SOLR-13681: Make Lucene's index sorting directly configurable in Solr
https://issues.apache.org/jira/browse/SOLR-13681
Filed by Christine Poerschke in 2019. Has a draft PR (#313) that adds
a direct <indexSort> config element to solrconfig.xml. The PR has been
stalled since 2021; the main open question is what should happen when
index sorting is enabled on an existing collection that already has
unsorted segments. Duplicates SOLR-12230 (deprecate SortingMergePolicy).
SOLR-12239: Enabling index sorting causes CorruptIndexException
https://issues.apache.org/jira/browse/SOLR-12239
When index sorting is enabled on an existing collection with unsorted
segments, reloading throws: "segment not sorted with indexSort=null".
The current workaround is to delete all data and reindex from scratch.
Notably, the related LUCENE-9484 ("Allow index sorting to happen after
the fact") was fixed in Lucene 9.0, which allows merging unsorted
segments into sorted ones retroactively. Solr has not wired this up.
SOLR-17170: Support Blocks in Index Sorting
https://issues.apache.org/jira/browse/SOLR-17170
Lucene 9.10+ supports block-aware presort during index sorting (via
Lucene PR #12829). This is critical for nested/block-join documents.
No Solr-side work has been done.
Summary
-------
Index sorting in Solr works for simple (non-nested) use cases via the
SortingMergePolicyFactory, but the implementation is showing its age:
- Configuration is indirect and hacky (merge policy as a Sort carrier)
- Cannot be safely enabled on existing collections without full reindex,
despite Lucene having solved this at the engine level since 9.0
- Incompatible with nested/block-join documents on Solr 10+
- The query-side early termination collector is deprecated
- A draft PR for direct configuration has been stalled since 2021
~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley