Re: Spark-based ingestion into Druid

Julian Jaffe Tue, 03 Mar 2020 14:02:12 -0800

I've submitted https://github.com/apache/druid/pull/9454 today to add a
`OnHeapMemorySegmentWriteOutMediumFactory`.


On Mon, Mar 2, 2020 at 8:57 AM Oğuzhan Mangır <oguzhan.man...@trendyol.com>
wrote:

>
>
> On 2020/02/26 13:26:13, itai yaffe <itai.ya...@gmail.com> wrote:
> > Hey,
> > Per Gian's proposal, and following this thread in Druid user group (
> > https://groups.google.com/forum/#!topic/druid-user/FqAuDGc-rUM) and this
> > thread in Druid Slack channel (
> > https://the-asf.slack.com/archives/CJ8D1JTB8/p1581452302483600), I'd
> like
> > to start discussing the options of having Spark-based ingestion into
> Druid.
> >
> > There's already an old project (
> https://github.com/metamx/druid-spark-batch)
> > for that, so perhaps we can use that as a starting point.
> >
> > The thread on Slack suggested 2 approaches:
> >
> >    1. *Simply replacing the Hadoop MapReduce ingestion task* - having a
> >    Spark batch job that ingests data into Druid, as a simple replacement
> of
> >    the Hadoop MapReduce ingestion task.
> >    Meaning - your data pipeline will have a Spark job to pre-process the
> >    data (similar to what some of us have today), and another Spark job
> to read
> >    the output of the previous job, and create Druid segments (again -
> >    following the same pattern as the Hadoop MapReduce ingestion task).
> >    2. *Druid output sink for Spark* - rather than having 2 separate Spark
> >    jobs, 1 for pre-processing the data and 1 for ingesting the data into
> >    Druid, you'll have a single Spark job that pre-processes the data and
> >    creates Druid segments directly, e.g
> sparkDataFrame.write.format("druid")
> >    (as suggested by omngr on Slack).
> >
> >
> > I personally prefer the 2nd approach - while it might be harder to
> > implement, it seems the benefits are greater in this approach.
> >
> > I'd like to hear your thoughts and to start getting this ball rolling.
> >
> > Thanks,
> >            Itai
> >
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
> For additional commands, e-mail: dev-h...@druid.apache.org
>
>

Re: Spark-based ingestion into Druid

Reply via email to