[GitHub] sascha-coenen commented on issue #5543: [Proposal] Native parallel batch indexing

GitBox Tue, 05 Feb 2019 13:32:21 -0800

sascha-coenen commented on issue #5543: [Proposal] Native parallel batch 
indexing
URL: 
https://github.com/apache/incubator-druid/issues/5543#issuecomment-460811816
 
 
   This motion is AWESOME AWESOME AWESOME!!!!
   Well done!! I cannot wait to see the two phase shuffle. This is SO much 
needed.
   
   I read in the comments section of related PRs about why one would need yet 
another data processing framework and what would be the issues with 
Spark/Hadoop. 
   This puzzles me for the following reason:
   I have tasked several people to find out how to combine batch processing and 
stream processing for Druid and although a lot of time was being sunk into the 
subject matter, not a SINGLE person was able to come up with a viable solution, 
myself included. Let it be said too, that we have been running a million dollar 
Druid cluster for several years now and keep trillions of records in it. 
   So neither are we new to Druid nor are we idiots and yet we keep scratching 
our heads about how to put the pieces together. 
   
   In my opinion, Druid needs native indexing support more than anything, 
especially in the context of finding a more wide-spread adoption and growing 
the community.
   
   I very much hope that more and more people can join in this effort. Most 
database systems come with native DML support and thus, competitor products 
like MPP databases such as Vertica have native support for ingesting big-data 
workloads. 
   Having a native batch indexing support in Druid would not only make Druid 
more competitive and easier sell, but it is strategically also an enabler for 
advanced setups, like putting Druid on kubernetes.  Containerizing Hadoop/Spark 
alone is not easy and far from being a small effort and doing it in a way that 
lets such a setup play nicely with Druid requires handcrafting the whole setup.
   Middlemanager however can easily be containerized (although it would be even 
nicer if there weren't any peons I guess) which in turn is a segway to 
co-locating different workloads on the same hardware. Achieving this for an 
ecosystem that encompasses Spark/Hadoop is something that only large companies 
with deep pockets and a bugdet for inhouse customizations can achieve.
   
   The second most needed feature is OLAP cubing (materialized views) which was 
added to Druid 0.13 as a prototype recently but currently requires a Hadoop 
cluster. So folks who went with a Spark-based indexing cannot use it unless 
they reinvent the wheel by adding support for it too.
   So in this sense, it is NOT the creation of a native processing framework 
that is "re-inventing the wheel" but on the contrary, it is precisely the 
previously chosen approach of having external processing frameworks that 
deserve this label.
    
   ---
   
   >> I'm going crazy because the library versions of Hadoop and Druid can't 
match
   +1
   
   ---
   
   >> I'm not sure about sharing the same shuffle system by both indexing and 
querying now because they need different requirements.
   +1
   Great thinking on behalf of jihoonson to propose this, but in the spirit of 
making babysteps it seems that one should first try to keep things easy by 
thinking about this in isolation. One can then make it an unrelated follow-up 
research task as to whether and how existing subsystems of Druid could be 
unified


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] sascha-coenen commented on issue #5543: [Proposal] Native parallel batch indexing

Reply via email to