hongkunxu opened a new issue, #16889:
URL: https://github.com/apache/pinot/issues/16889

   ### Description
   
   Currently, Pinot’s DataIngestionJob has a limitation when performing 
backfill ingestion. The job assumes that the backfill run will generate the 
same number of segments (or more) compared to the original ingestion.
   
   When the backfill input directory contains fewer files than the original 
run, the segment generation job will produce fewer segments. As a result, only 
part of the existing segments will be replaced, and the remaining old segments 
will continue to exist in the table, causing stale data issues.
   
   ### Example
   
   - Suppose table airlineStats has 2 segments for 2014-01-01:
           - airlineStats_2014-01-01_2014-01-01_0
           - airlineStats_2014-01-01_2014-01-01_1
   
   - The backfill input directory only contains 1 input file for the same date.
   - The segment generation job produces just 1 segment:
          - airlineStats_2014-01-01_2014-01-01_0
   - After pushing, only _0 gets replaced, while _1 from the original ingestion 
is still present, leading to incorrect/stale data.
   
   ### Impact
   
   If raw data changes such that a given time bucket has fewer input files than 
the first ingestion run, backfill will fail to fully replace existing segments. 
This makes it difficult to rely on backfill for correcting historical data.
   
   ### Proposal
   
   Introduce a new job, tentatively named BackfillIngestionJob, which is 
designed to correctly handle these edge cases. This job should:
   
   1. Ensure that all original segments in the target time range are 
replaced/removed.
   2. Guarantee that stale data from older segments does not persist after 
backfill.
   3. Provide a consistent and reliable workflow for batch backfill ingestion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to