RS146BIJAY commented on issue #13387: URL: https://github.com/apache/lucene/issues/13387#issuecomment-2470112408
On some more analysis figured out an approach which addresses all the above comments and obtain same improvement with different IndexWriter for different group as we got with using different DWPTs for different group. ## Using separate IndexWriter for maintaining different tenants with a combined view ### Current Issue Maintaining separate IndexWriter for different groups (tenant) presents a significant problem as they do not function as a single unified entity. Although distinct IndexWriters and directories for each group ensures that data belonging to different groups are kept in separate segments and segments within the same group are merged, a unified read-only view for Client (OpenSearch) to interact with these multiple group-level IndexWriters is still needed. Lucene’s addIndexes api offers a way to combine group-level IndexWriters into single parent-level IndexWriter, but this approach has multiple drawbacks: 1. Since writes may continue on group-level IndexWriters, periodic synchronisation with the parent-level IndexWriter is necessary. 2. During synchronisation, an external lock needs to be placed on the group level IndexWriter directory, causing downtime. 3. Synchronisation will also involve copying files from the group level IndexWriter directory to the parent IndexWriter directory, which is resource-intensive, consuming disk IO and CPU cycles. ### Proposal To address this issue, we propose introducing a mechanism that combines group-level IndexWriters as a soft reference to a parent IndexWriter. This will be achieved by creating a new variant of the addIndexes API within IndexWriter, which will only combine the SegmentInfos of group-level IndexWriter without requiring an external lock or copying files across directories. Group-level segments will be maintained in separate directories associated with their respective group-level IndexWriters. The client will periodically call (for OpenSearch side this corresponds to index refresh interval of 1 sec) this addIndexes API on the parent IndexWriter, passing the segmentInfos of child-level IndexWriter as parameters to sync the latest SegmentInfos with the parent IndexWriter. While combining the SegmentInfos of child-level IndexWriters, the addIndexes API will attach a prefix to the segment names to identify the group each Segments belongs to, avoiding name conflicts between segments of different group-level IndexWriters. ![compositeIndexWriter drawio (1)](https://github.com/user-attachments/assets/8ddd8568-a352-41ac-bc42-ce3cb4647f8f) The parent IndexWriter will be associated with a filter directory that will distinguishes the tenant using the file name prefix, redirecting any read/write operations on a file to the correct group level directory using segment file prefix name. #### Reason for choosing common view as an IndexWriter Most interactions of Lucene with the client (OpenSearch) such as opening a reader, getting the latest commit info, reopening a Lucene index, etc occurs via IndexWriter itself. Thus selecting IndexWriter as a common view made more sense. ### Improvements with multiple IndexWriter with a combined view We were able to observe around 50% - 60% improvements with multiple IndexWriter with a combined view approach similar to what we observed by having different DWPTs for different tenant (initial proposal). ### Considerations 1. The referencing IndexWriter will be a combined read only view for group level IndexWriters. Since this IndexWriter does not itself has any segments and is only referencing segment Infos of other IndexWriters, write operation like segment merge, flush etc should not be performed on this parent IndexWriter instance. 2. We need to consider prefix name attached before segment names when [parsing segment names](https://github.com/RS146BIJAY/lucene/blob/84811e974f38181b0c1f1e1b5655f674a1584385/lucene/core/src/java/org/apache/lucene/index/IndexFileNames.java#L119). 3. It will be difficult to support update queries with multi IndexWriter approach. For eg: If we are grouping logs on status code and user update the status code field of the logs, for lucene, insert and update operations needs to be performed on the separate delete queue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org