It's my pleasure to be a mentor of Datark. I'm looking forward to the
feedback on the incubation proposal.

Cheers,

Jiangjie (Becket) Qin

On Thu, Sep 22, 2022 at 11:45 AM Yu Li <car...@gmail.com> wrote:

> Hi All,
>
> I would like to propose Datark [1] as a new apache incubator project, and
> you can find the proposal [2] of Datark for more details.
>
> Datark is an intermediate (shuffle and spilled) data service for big data
> compute engines (Apache Spark, Apache Flink, Apache Hive, etc.) to boost
> performance, stability, and flexibility. It aims at enabling computing
> engines to fully embrace the disaggregated architecture. In a lot of cases,
> intermediate data depends on large local disks, and is often a major cause
> of inefficiency, instability, and inflexibility in the lifecycle of a
> distributed job. Datark solves the problems through the following core
> designs:
>
> 1. Push-based shuffle plus partition data aggregation to turn random IO
> access into sequential access.
> 2. FileSystem-like API to support writing spilled data.
> 3. Hierarchical storage from memory to DFS/object store to enable fast
> cache and massive storage space.
> 4. Engine-irrelevant APIs for easy integrating to various engines.
> 5. Extended fault tolerance and data replication to increase reliability
>
> Datark is currently adopted in the production environment at both Alibaba
> and many other companies, serving petabytes of data per day. Beyond that,
> it has more open source users including Shopee, NetEase, Bilibily, BOSS,
> and Synnex. Most of these users have made contributions to the project,
> forming an active community with dozens of developers.
>
> The proposed initial committers are interested in joining ASF to reinforce
> extensive collaboration and build a more vibrant community. We believe the
> Datark project will provide tremendous value for the community if it is
> introduced into the Apache incubator.
>
> I will help this project as the champion and many thanks to our four other
> mentors:
>
> * Becket Qin (j...@apache.org)
> * Duo Zhang (zhang...@apache.org)
> * Lidong Dai (lidong...@apache.org)
> * Willem Jiang (ningji...@apache.org)
>
> FWIW, although with different solutions, the issues Datark aims to resolve
> have some overlap with Apache Uniffle (incubating) [3]. Actually we noticed
> this during the discussion phase of Uniffle incubation (when we were also
> preparing for the incubation) and had some open and friendly discussion to
> see whether there could be a joint force [4], and finally decided to
> develop independently for the time being [5].
>
> Look forward to your feedback. Thanks.
>
> Best Regards,
> Yu
>
> [1] https://github.com/alibaba/RemoteShuffleService
> [2] https://cwiki.apache.org/confluence/display/INCUBATOR/DatarkProposal
> [3] https://uniffle.apache.org/
> [4] https://lists.apache.org/thread/1w74z5f0pb7bhslhzcl5x7rdj9s9objz
> [5] https://lists.apache.org/thread/pg8lzhzc1794x3yloqp169j0mdzqs3yw
>

Reply via email to