[GitHub] [druid] loquisgon edited a comment on issue #11231: Minimize memory utilization in Sinks/Hydrants for native batch ingestion

GitBox Mon, 17 May 2021 11:47:46 -0700


loquisgon edited a comment on issue #11231:
URL: https://github.com/apache/druid/issues/11231#issuecomment-842544860



   @jihoonson I see your point that you still need clarification of what needs 
to be done. Yet I am hesitant to do another pass to the document above because 
it might muddle things further. However, let me tell you precisely, briefly 
what my concrete plan is. Analysis of the code, tests and preliminary coding 
strongly suggest that keeping the data structures for `Sink` and `Firehydrant` 
in memory can make ingestion run out of memory. Therefore my plan is pretty 
simple. 
   
   1.  After each persist just remove all references to `Sink` and 
`Firehidrant` and keeping just enough metadata in memory to recover them from 
disk later as needed (i.e. directory path for the `Sink`, metadata about the 
`Sink` like number of rows in memory so far, etc.) 
   2.  When new data arrives after a persist during the same ingestion for the 
file just recreate the `Sink` as usual and create new `Firehydrant` .
   3.  Repeat (1-2) as long as rows from the file are bing added.
   4.  At the end of processing all rows for the input file, just before the 
final `push`, first make sure all in-memory `Sink` at this point are persisted 
& closed (as in 1, to control memory utilization), then just recover the `Sink` 
& `Firehydrant` from disk, merge `Firehydrant` and push the `Sink`, for all 
`Sink` one by one.
   5.  Occasionally, when `maxRowsPerSegment` is hit in the 
`InputSourceProcessor` when a row was just added then a push (4) will happen as 
well then continue adding rows as usual (2)
   
   All the persist & push actions are now strictly sequential & synchronous to 
guarantee correctness.
   
   Therefore the scope for this proposal is strictly limited to the above to 
manage risk & complexity and achieve important value (i.e. drastically reducing 
the probability of OOM in these cases). The introduction of a new 
`Appenderator` is just common software engineering when we understand that it 
really should have a different code path from the real time case. I believe 
that this code physical & conceptual separation will open up new critical 
opportunities (such not using the `Sink` and `Firehidrant` data structure & 
layout for intermediate persists of batch and even maybe introducing a 
pre-sorting as well) but these future opportunities are out of scope for this 
proposal. So the end result is that when the proposal is implemented and merged 
most probably the code will still use previous patterns and data structures 
that may need to be improved & cleaned up in the future. Again, this is done 
for agility and incremental value delivery.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] loquisgon edited a comment on issue #11231: Minimize memory utilization in Sinks/Hydrants for native batch ingestion

Reply via email to