bhabegger opened a new pull request, #2943:
URL: https://github.com/apache/jackrabbit-oak/pull/2943

   ## Summary
   
   Adds accurate total document count tracking to Oak's fulltext indexing 
layer. The count is persisted to `:status/totalIndexedNodes` after each 
indexing cycle and will be used by a follow-up (OAK-12248) to short-circuit 
queries against known-empty indexes.
   
   ### Changes
   
   **New interface methods on `FulltextIndexWriter`:**
   - `getDocCount()` — total count captured during `close()`, -1 if unavailable
   - `verifyDocCount(long expected)` — periodic cross-check against an 
independent backend measurement (called every `VERIFY_INTERVAL` cycles, default 
10, configurable via `oak.index.totalIndexedNodes.verifyInterval`)
   
   **Backend implementations:**
   - `DefaultIndexWriter` (Lucene): captures `writer.numDocs()` before 
`writer.close()`, opens a fresh `DirectoryReader` before `directory.close()` 
for independent verification
   - `ElasticIndexWriter`: forces an index refresh then calls the ES count API 
after bulk flush; runs a second count call for verification
   
   **`IndexDefinition` readers:**
   - `PROP_TOTAL_INDEXED_NODES = "totalIndexedNodes"`
   - `getTotalIndexedNodes()` — reads from `:status`, -1 if absent
   - `isReindexCompleted()` — true if `REINDEX_COMPLETION_TIMESTAMP` is set in 
`:status`
   
   **`FulltextIndexEditorContext.closeWriter()` updated:**
   - Gated by `FT_OAK_12247_DISABLE` kill switch (default `false` = tracking 
active)
   - Writes `totalIndexedNodes` from `getDocCount()` every cycle
   - For reindex cycles, always writes `REINDEX_COMPLETION_TIMESTAMP` 
regardless of `indexUpdated` — closes the Elastic gap where an empty reindex 
left `:status` untouched
   
   ### Design decisions
   
   - **`total >= 0` guard**: protects `ReadOnlyBuilder` used in NRT/hybrid 
Lucene indexing, preventing `UnsupportedOperationException` during sync commits
   - **ES refresh before count**: prevents stale count on async reindex where 
no automatic refresh occurs
   - **Lucene independent reader**: `DirectoryReader.numDocs()` is captured 
before `directory.close()` to give `verifyDocCount()` a genuine cross-check
   
   ## Test plan
   
   - [ ] `FulltextIndexEditorContextTest` — 3 tests: empty reindex (docCount=0, 
indexUpdated=false persists both `totalIndexedNodes=0` and 
`REINDEX_COMPLETION_TIMESTAMP`), non-empty reindex, incremental cycle
   - [ ] `ElasticIndexWriterTest` — 2 new tests for `getDocCount()` and 
`verifyDocCount()`
   - [ ] `DefaultIndexWriterTest` — 2 new tests for `getDocCount()` and 
`verifyDocCount()`
   - [ ] `mvn test -pl oak-search` (167 tests pass)
   - [ ] `mvn test -pl oak-lucene` (1240 tests pass)
   - [ ] `mvn test -pl oak-search-elastic -Dtest=ElasticIndexWriterTest` (10 
tests pass)
   
   ## Related
   
   - OAK-12248 (follow-up): planner and query short-circuit for known-empty 
indexes
   - OAK-12249 (follow-up): lazy ES index provisioning for empty reindex


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to