loquisgon edited a comment on issue #11231: URL: https://github.com/apache/druid/issues/11231#issuecomment-842544860
@jihoonson I see your point that you still need clarification of what needs to be done. Yet I am hesitant to do another pass to the document above because it might muddle things further. However, let me tell you precisely, briefly what my concrete plan is. Analysis of the code, tests and preliminary coding strongly suggest that keeping the data structures for `Sink` and `Firehydrant` in memory can make ingestion run out of memory. Therefore my plan is pretty simple. 1. After each persist just remove all references to `Sink` and `Firehidrant` and keeping just enough metadata in memory to recover them from disk later as needed (i.e. directory path for the `Sink`, metadata about the `Sink` like number of rows in memory so far, etc.) 2. When new data arrives after a persist during the same ingestion for the file just recreate the `Sink` as usual and create new `Firehydrant` . 3. Repeat (1-2) as long as rows from the file are bing added. 4. At the end of processing all rows for the input file, just before the final `push`, first make sure all in-memory `Sink` at this point are persisted & closed (as in 1, to control memory utilization), then just recover the `Sink` & `Firehydrant` from disk, merge `Firehydrant` and push the `Sink`, for all `Sink` one by one. 5. Occasionally, when `maxRowsPerSegment` is hit in the `InputSourceProcessor` when a row was just added then a push (4) will happen as well then continue adding rows as usual (2) All the persist & push actions are now strictly sequential & synchronous to guarantee correctness. Therefore the scope for this proposal is strictly limited to the above to manage risk & complexity and achieve important value (i.e. drastically reducing the probability of OOM in these cases). The introduction of a new `Appenderator` is just common software engineering when we understand that it really should have a different code path from the real time case. I believe that this code physical & conceptual separation will open up new critical opportunities (such not using the `Sink` and `Firehidrant` data structure & layout for intermediate persists of batch and even maybe introducing a pre-sorting as well) but these future opportunities are out of scope for this proposal. So the end result is that when the proposal is implemented and merged most probably the code will still use previous patterns and data structures that may need to be improved & cleaned up in the future. Again, this is done for agility and incremental value delivery. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
