bhabegger opened a new pull request, #2793:
URL: https://github.com/apache/jackrabbit-oak/pull/2793

   # LuceneNg (Lucene 9.12.2) Implementation - Phase 1 & 2
   
   This PR implements a new Lucene 9 index module (`oak-search-luceneNg`) for 
Apache Jackrabbit Oak, targeting the OAK-12089 epic.
   
   ## ๐ŸŽฏ Goals of This PR
   
   | Goal | Description | Status |
   |------|-------------|--------|
   | **New Module** | Create `oak-search-luceneNg` module with Lucene 9.12.2 | 
โœ… Complete |
   | **Write Path** | Implement document indexing via IndexEditor | โœ… Complete |
   | **Storage** | Oak-native storage with chunked blob support | โœ… Complete |
   | **Read Path** | Implement query execution with full-text search | โœ… 
Complete |
   | **Property Queries** | Support equality constraints on indexed properties 
| โœ… Complete |
   | **Full-Text Search** | Support analyzed text queries with tokenization | โœ… 
Complete |
   | **Test Coverage** | Comprehensive unit and integration tests | โœ… Complete 
(53/53 tests pass) |
   | **Build Integration** | Maven build, OSGi bundles, Apache RAT compliance | 
โœ… Complete |
   
   ## ๐Ÿ“Š Implementation Status & Roadmap
   
   ### โœ… Phase 1: Write Path (Complete)
   
   | Component | Status | Tests | Notes |
   |-----------|--------|-------|-------|
   | **LuceneNgIndexEditor** | โœ… Done | 7 tests | Indexes string properties 
(single & multi-value) |
   | **OakDirectory** | โœ… Done | 16 tests | Lucene Directory backed by Oak 
NodeStore |
   | **Chunked I/O** | โœ… Done | 5 tests | Efficient large file handling with 
1MB chunks |
   | **IndexWriter lifecycle** | โœ… Done | 7 tests | Shared writer pattern for 
correct commit semantics |
   
   ### โœ… Phase 2: Read Path - Basic Queries (Complete)
   
   | Component | Status | Tests | Notes |
   |-----------|--------|-------|-------|
   | **LuceneNgIndex** | โœ… Done | 2 tests | QueryIndex implementation with cost 
calculation |
   | **Full-text queries** | โœ… Done | 2 tests | Visitor pattern, tokenization, 
phrase/term queries |
   | **Property queries** | โœ… Done | 5 tests | Exact-match equality constraints 
|
   | **LuceneNgCursor** | โœ… Done | 7 tests | Result iteration with score 
support |
   | **Query planner integration** | โœ… Done | 2 tests | Cost-based index 
selection (cost = 2.0) |
   
   ### ๐Ÿšง Phase 2: Read Path - Advanced (Planned)
   
   | Feature | Priority | Complexity | Notes |
   |---------|----------|------------|-------|
   | **Range queries** | High | Medium | Support `<`, `>`, `<=`, `>=` operators 
|
   | **Boolean queries** | High | Medium | Complex AND/OR/NOT combinations |
   | **Sorting** | Medium | Medium | ORDER BY support |
   | **Aggregation rules** | Medium | High | Property aggregation across node 
types |
   | **Highlighting** | Low | Medium | rep:excerpt support |
   | **Faceting** | Low | High | rep:facet support |
   
   ### โณ Phase 3: Migration & Production (Future)
   
   | Feature | Priority | Complexity | Notes |
   |---------|----------|------------|-------|
   | **Hot migration** | High | High | Migrate from Lucene 4.7 without downtime 
|
   | **Index compatibility** | High | High | Read existing lucene indexes |
   | **Performance benchmarks** | High | Medium | Compare with legacy Lucene |
   | **AEM integration testing** | High | High | Validate in AEM environment |
   | **Documentation** | Medium | Low | Usage guides, migration docs |
   
   ## ๐Ÿ“ฆ What's Included
   
   ### New Module Structure
   ```
   oak-search-luceneNg/
   โ”œโ”€โ”€ src/main/java/
   โ”‚   โ””โ”€โ”€ org/apache/jackrabbit/oak/plugins/index/luceneNg/
   โ”‚       โ”œโ”€โ”€ LuceneNgIndex.java              # Query execution
   โ”‚       โ”œโ”€โ”€ LuceneNgIndexEditor.java        # Document indexing
   โ”‚       โ”œโ”€โ”€ LuceneNgCursor.java             # Result iteration
   โ”‚       โ”œโ”€โ”€ LuceneNgIndexTracker.java       # Index lifecycle
   โ”‚       โ”œโ”€โ”€ LuceneNgIndexDefinition.java    # Index metadata
   โ”‚       โ”œโ”€โ”€ IndexSearcherHolder.java        # Search resource management
   โ”‚       โ””โ”€โ”€ directory/
   โ”‚           โ”œโ”€โ”€ OakDirectory.java           # Lucene Directory implementation
   โ”‚           โ”œโ”€โ”€ OakIndexInput.java          # Read operations
   โ”‚           โ””โ”€โ”€ OakIndexOutput.java         # Write operations
   โ””โ”€โ”€ src/test/java/
       โ”œโ”€โ”€ LuceneNgComparisonTest.java         # Property query validation
       โ”œโ”€โ”€ IntegrationTest.java                # End-to-end tests
       โ”œโ”€โ”€ IndexingFunctionalTest.java         # Indexing edge cases
       โ””โ”€โ”€ directory/                          # Storage layer tests
   ```
   
   ### Key Features
   
   **Query Support:**
   - โœ… Full-text search with StandardAnalyzer tokenization
   - โœ… Property equality queries (`@property = 'value'`)
   - โœ… Proper cost-based query planning
   - โœ… Score-based result ranking
   
   **Indexing:**
   - โœ… String properties (single and multi-value)
   - โœ… Full-text aggregation to `:fulltext` field
   - โœ… Exact-match fields for property queries
   - โœ… 32KB term length handling
   
   **Storage:**
   - โœ… Oak NodeStore integration via `:data` child node
   - โœ… Chunked blob storage (1MB chunks)
   - โœ… Concurrent read/write support
   - โœ… Memory-efficient streaming
   
   ## ๐Ÿงช Test Results
   
   **All 53 tests pass:**
   - โœ… 16 OakDirectory tests (storage layer)
   - โœ… 7 IndexingFunctionalTest (write path)
   - โœ… 5 LuceneNgComparisonTest (property queries)
   - โœ… 5 IntegrationTest (end-to-end)
   - โœ… 20 additional unit tests (components, tracking, etc.)
   
   **Build:**
   ```
   mvn clean install
   [INFO] Tests run: 53, Failures: 0, Errors: 0, Skipped: 0
   [INFO] BUILD SUCCESS
   ```
   
   ## ๐Ÿ” Technical Highlights
   
   ### 1. Proper Full-Text Query Building
   Implements visitor pattern matching legacy Lucene behavior:
   - Tokenizes query text using StandardAnalyzer
   - Builds PhraseQuery for multi-token terms
   - Handles FullTextAnd, FullTextOr, FullTextTerm expressions
   
   ### 2. Shared IndexWriter Pattern
   Root editor creates IndexWriter, child editors share it:
   - Prevents data loss from multiple writers
   - Correct commit semantics across node tree
   - Proper resource cleanup
   
   ### 3. Dynamic NodeBuilder Access
   Avoids staleness issues during commits:
   ```java
   private NodeBuilder getDirectoryBuilder() {
       return definitionBuilder.child(INDEX_DATA_CHILD_NAME);
   }
   ```
   
   ### 4. Field Strategy
   - **StringField**: Exact matching for property queries (not analyzed)
   - **TextField**: Analyzed text for full-text search (FieldNames.FULLTEXT)
   - **Path storage**: Stored field for cursor results
   
   ## ๐Ÿ”— Related Issues
   
   - **OAK-12089**: Epic for Lucene 9 migration
   - Builds on exploration work from earlier branches
   
   ## ๐Ÿ“ Notes for Reviewers
   
   1. **Module isolation**: New module doesn't affect existing lucene/elastic 
modules
   2. **Dependency embedding**: Lucene 9.12.2 libs embedded to avoid conflicts
   3. **Test independence**: All tests use in-memory storage, no external 
dependencies
   4. **Apache compliance**: All files have Apache license headers, RAT check 
passes
   
   ## โœ… Checklist
   
   - [x] All tests pass
   - [x] Apache RAT license check passes
   - [x] Code follows Oak patterns (QueryIndex, IndexEditor, Cursor)
   - [x] No backwards compatibility issues (new module, opt-in)
   - [x] Documentation in code comments
   - [x] Test coverage for all major code paths
   
   ---
   
   **Ready for review!** This PR establishes the foundation for Lucene 9 
support in Oak. Phase 2 advanced features and Phase 3 migration can be tackled 
in subsequent PRs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to