[GitHub] [druid] loquisgon commented on issue #11231: Minimize memory utilization in Sinks/Hydrants for native batch ingestion

GitBox Thu, 20 May 2021 10:09:40 -0700


loquisgon commented on issue #11231:
URL: https://github.com/apache/druid/issues/11231#issuecomment-845297560



   I think the reason that the ingestion takes that long is that the data is 
intentionally somewhat pathological (even though it simulates a real case in 
production). It is a series of events over 30 years, every day in between 
having data. However, the data in every day is only in the order of ~100 rows. 
Thus there will be about ~10000 segments at the end, all pretty small. I see in 
my tests that disk i/o dominates, almost no cpu utilization. This is because of 
the intermediary writes and the merges at the end. I am running my test in my 
laptop which may also make things worse (but I don't think so, though I noticed 
that the antivirus sometimes interfered since it insisted in looking at all the 
tiny intermediate files being created, especially in the random ingest case for 
same file). I am using DAY granularity. When I used MONTH for same file, it 
creates 360 segments and it takes an order of magnitude (i.e. 10 times less) 
time than DAY granularity.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] loquisgon commented on issue #11231: Minimize memory utilization in Sinks/Hydrants for native batch ingestion

Reply via email to