loquisgon edited a comment on issue #11231: URL: https://github.com/apache/druid/issues/11231#issuecomment-845297560
@gianm Good points about the code separation. Yeah, I strongly believe that the code paths serve different use cases (streaming vs batch -- mainly in streaming we need to query the data as it is being ingested while in batch we don't). Therefore splitting the classes is a good idea. Great point about the superclass. I understand your point, avoiding the superclass is better (inheritance can be a form of coupling). I think the reason that the ingestion takes that long is that the data is intentionally somewhat pathological (even though it simulates a real case in production). It is a series of events over 30 years, every day in between having data. However, the data in every day is only in the order of ~100 rows. Thus there will be about ~10000 segments at the end, all pretty small. I see in my tests that disk i/o dominates, almost no cpu utilization. This is because of the intermediary writes and the merges at the end. I am running my test in my laptop which may also make things worse (but I don't think so, though I noticed that the antivirus sometimes interfered since it insisted in looking at all the tiny intermediate files being created, especially in the random ingest case for same file). I am using DAY granularity. When I used MONTH for same file, it creates 360 segments and it takes an order of magnitude (i.e. 10 times less) less time than DAY granularity. The file is a csv, with abou t 250 columns. Uncompressed it is about 4.3 G in size. By the way, I agree with you that ingesting a 1M row file in 1.5 hours sounds like too much. This is not because my changes; since the changes are removing work, it can only make things faster. One way to speed it up is to realize that for batch ingestion the intermediate persists don't have to be in the "segment" format. If we did intermediate persists (for batch only) in a different format (maybe a log data structure optimized for appends) and then created the real segment at the final merge/push phase then I believe things would be way faster. I think this will also accelerate unsorted data ingest. (I don't show data here but the unsorted case for the test file also works with the changes in the proposal but it takes 13 hours!). About future work. I used an experimental approach to this work. I knew that OOMs were an issue. I decided to take a look at dynamic ingestion since it is the most basic form. I found these issues, which I believe are orthogonal to other issues. If this proposal is accepted, I plan to implement it. Then after that I will use the same approach to hash & range partitioning. Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
