Re: Pointers on implementing a new ShardSpec

Gian Merlino Mon, 14 Jan 2019 09:35:16 -0800

IMO, if you think it makes sense to contribute an interface to it too (some
option that triggers it) then I'd say write up a short proposal (motivation
/ proposed change as a GitHub issue) as the first step towards
contributing. I don't think I understand it well enough from your
description so an issue helps with that. I also don't feel that we've
totally decided on what the process should be there, but, that's the
direction the conversation seems to be going in the other thread as to how
new changes should start.


On Fri, Jan 11, 2019 at 1:26 PM Julian Jaffe <jja...@pinterest.com.invalid>
wrote:

> The new shard spec
>
> On Fri, Jan 11, 2019 at 8:37 AM Gian Merlino <g...@apache.org> wrote:
>
> > Do you mean the modifications to the metamx/druid-spark-batch project or
> > the new ShardSpec you're working on?
> >
> > On Thu, Jan 10, 2019 at 3:09 PM Julian Jaffe
> <jja...@pinterest.com.invalid
> > >
> > wrote:
> >
> > > Thanks for the detailed pointers, Gian! In light of the ongoing
> > discussion
> > > around on-list development, does this seem like something that's
> > worthwhile
> > > to anyone else in the community?
> > >
> > > On Tue, Jan 8, 2019 at 10:32 AM Gian Merlino <g...@apache.org> wrote:
> > >
> > > > Hey Julian,
> > > >
> > > > There aren't any gotchas that I can think of other than the fact that
> > > they
> > > > are not super well documented, and you might miss some features if
> > you're
> > > > just skimming the code. A couple points that might matter,
> > > >
> > > > 1) PartitionChunk is what allows a shard spec to contribute to the
> > > > definition of whether a set of segments for a time chunk is
> "complete".
> > > > It's an important concept since the broker will not query segment
> sets
> > > > unless the chunk is complete. The way the completeness check works is
> > > > basically that the broker will get all the ShardSpecs for all the
> > > segments
> > > > in a time chunk, order them by partitionNum, generate the partition
> > > chunks,
> > > > and check if (a) the first one is a starter based on "isStart", (b)
> any
> > > > subsequent ones until the end [based on "isEnd"] abut the previous
> one
> > > > [based on "abuts"]. Some ShardSpecs use nonsensical-at-first-glance
> > logic
> > > > from these methods to short circuit the completeness checks: time
> > chunks
> > > > with LinearShardSpecs are _always_ considered complete. Time chunks
> > with
> > > > NumberedShardSpecs can have "partitionNum" go beyond "partitions",
> and
> > > are
> > > > considered complete if the first "partitions" number of segments are
> > > > present.
> > > >
> > > > 2) "getDomainDimensions" and "possibleInDomain" are optional, but
> > useful
> > > > for powering broker-side segment pruning.
> > > >
> > > > 3) All segments in the same time chunk must have the same kind of
> > > > ShardSpec. However, it can vary from time chunk to time chunk within
> a
> > > > datasource.
> > > >
> > > > On Mon, Jan 7, 2019 at 2:56 PM Julian Jaffe
> > <jja...@pinterest.com.invalid
> > > >
> > > > wrote:
> > > >
> > > > > Hey all,
> > > > >
> > > > > Are there any major caveats or gotchas I should be aware of when
> > > > > implementing a new ShardSpec? The context here is that we have a
> > > > datasource
> > > > > that is the combined result of multiple input jobs. We're trying to
> > do
> > > > > write-side joining by having all of the jobs write segments for the
> > > same
> > > > > intervals (e.g. partitioning on both partition number and source
> > > > pipeline).
> > > > > For now, I've modified the Spark-Druid batch ingestor (
> > > > > https://github.com/metamx/druid-spark-batch) to run in our various
> > > > > pipelines and to write out segments with identifier form
> > > > >
> > `dataSource_startInterval_endInterval_version_sourceName_partitionNum.
> > > > This
> > > > > is working without issue for loading, querying, and deleting data,
> > but
> > > > the
> > > > > metadata API reports the incorrect segment identifier, since it
> > > > > reconstructs the identifier instead of reading from metadata (e.g.
> it
> > > > > reports segment identifiers of the form
> > > > > `dataSource_startInterval_endInterval_version_partitionNum`). Both
> > > > because
> > > > > we'd like this to be fully supported, and because we imagine that
> > this
> > > > > feature may be useful to others, I'd like to implement this via a
> > > > > ShardSpec.
> > > > >
> > > > > Julian
> > > > >
> > > >
> > >
> >
>

Re: Pointers on implementing a new ShardSpec

Reply via email to