itschrispeck opened a new pull request, #12744:
URL: https://github.com/apache/pinot/pull/12744

   **Problem:**
   We typically see long (7-10min) segment build times when using Lucene index 
with 1-1.5GB segment sizes. 70-80% of this time is spent building the Lucene 
text index. 
   
   **Background:**
   In the existing implementation the Lucene index stores Pinot docIds: for the 
mutable segment these are the 'mutable' docIds, for the immutable segment we 
store each row with its new docId. Lucene queries return the matching Lucene 
DocIds, and we compute these on the fly for the mutable index, or from a 
mapping file for the immutable index. 
   
   **Change Summary:**
   This change copies the mutable Lucene index during realtime segment 
conversion to reuse, instead of building a new Lucene index. To handle the 
potential docId change `sortedDocIds` is added to `IndexCreationContext` to 
compute a temporary mapping between the mutable docId and the immutable 
segment's docId. This temporary mapping is used during segment conversion to 
build the mapping file between the Lucene docId and the new immutable segment's 
docId. This mapping file is built during segment conversion, instead of during 
segment load in the traditional path.
   
   Internally we've seen roughly 40-60% improvement in overall segment build 
time. The lower peaks are from a table/tenant with this change, the higher 
ingestion delay peaks are from an identical table in a tenant without this 
change:
   
   <img width="1017" alt="image" 
src="https://github.com/apache/pinot/assets/27231838/0ab23a4c-f7d3-4332-9c5b-e662925c6f9c";>
   
   Testing: deployed internally, local testing, validated basic 
pause/restart/reload operations on a table to ensure no regression in 
TextIndexHandler index build.
   
   tags: ingestion `performance`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@pinot.apache.org
For additional commands, e-mail: commits-h...@pinot.apache.org

Reply via email to