I started investigating the current state of Lucene's index sorting support
in Solr.  I had Claude Opus write a report for me.  Rather than hoard it to
myself, I'm sharing with everyone in case others are wondering what's up as
well.


Background
----------

Lucene has supported index-level sorting since LUCENE-6766 (Lucene 6.2),
where segments are internally sorted by a configurable field order at
flush/merge time. This enables significant query-time optimizations --
when the query's sort matches the index sort, Lucene can skip entire
segments or terminate collection early.


Current Solr Support
--------------------

Solr does support index sorting today, but through an indirect mechanism:

Configuration is done via SortingMergePolicyFactory in solrconfig.xml:

    <mergePolicyFactory
class="org.apache.solr.index.SortingMergePolicyFactory">
      <str name="sort">timestamp desc</str>
      <str name="wrapped.prefix">inner</str>
      <str
name="inner.class">org.apache.solr.index.TieredMergePolicyFactory</str>
    </mergePolicyFactory>

Internally, SortingMergePolicy is a FilterMergePolicy that does nothing
merge-policy-related -- it simply holds a Sort object. SolrIndexConfig
then has a special instanceof check that extracts this Sort and calls
IndexWriterConfig.setIndexSort(). The class itself has a TODO comment
acknowledging this is a workaround: "remove this and add indexSort
specification directly to solrconfig.xml?"

Query-side integration exists via the "segmentTerminateEarly" query
parameter, which wraps the collector in an EarlyTerminatingSortingCollector.
Note that this collector is @Deprecated -- modern Lucene's TopFieldCollector
handles early termination natively when it detects sorted segments.

The /admin/segments API (with coreInfo=true) exposes the indexSort
configuration and per-segment sort info.

AtomicUpdateDocumentMerger correctly detects fields used for index sorting
and prevents DocValues-only updates on them (a Lucene limitation).


Open Issues
-----------

Several open JIRA issues relate to this area.

SOLR-9108: Improve how index time sorting is configured
https://issues.apache.org/jira/browse/SOLR-9108

  Filed by Mike McCandless in 2016 right after LUCENE-6766. Proposes
  configuring index sort directly in solrconfig.xml alongside other
  IndexWriter settings rather than piggybacking on the merge policy.

SOLR-13681: Make Lucene's index sorting directly configurable in Solr
https://issues.apache.org/jira/browse/SOLR-13681

  Filed by Christine Poerschke in 2019. Has a draft PR (#313) that adds
  a direct <indexSort> config element to solrconfig.xml. The PR has been
  stalled since 2021; the main open question is what should happen when
  index sorting is enabled on an existing collection that already has
  unsorted segments. Duplicates SOLR-12230 (deprecate SortingMergePolicy).

SOLR-12239: Enabling index sorting causes CorruptIndexException
https://issues.apache.org/jira/browse/SOLR-12239

  When index sorting is enabled on an existing collection with unsorted
  segments, reloading throws: "segment not sorted with indexSort=null".
  The current workaround is to delete all data and reindex from scratch.
  Notably, the related LUCENE-9484 ("Allow index sorting to happen after
  the fact") was fixed in Lucene 9.0, which allows merging unsorted
  segments into sorted ones retroactively. Solr has not wired this up.

SOLR-17170: Support Blocks in Index Sorting
https://issues.apache.org/jira/browse/SOLR-17170

  Lucene 9.10+ supports block-aware presort during index sorting (via
  Lucene PR #12829). This is critical for nested/block-join documents.
  No Solr-side work has been done.


Summary
-------

Index sorting in Solr works for simple (non-nested) use cases via the
SortingMergePolicyFactory, but the implementation is showing its age:

- Configuration is indirect and hacky (merge policy as a Sort carrier)
- Cannot be safely enabled on existing collections without full reindex,
  despite Lucene having solved this at the engine level since 9.0
- Incompatible with nested/block-join documents on Solr 10+
- The query-side early termination collector is deprecated
- A draft PR for direct configuration has been stalled since 2021

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

Reply via email to