Re: Pointers on implementing a new ShardSpec

Julian Jaffe Fri, 11 Jan 2019 13:27:24 -0800

The new shard spec

On Fri, Jan 11, 2019 at 8:37 AM Gian Merlino <g...@apache.org> wrote:


> Do you mean the modifications to the metamx/druid-spark-batch project or
> the new ShardSpec you're working on?
>
> On Thu, Jan 10, 2019 at 3:09 PM Julian Jaffe <jja...@pinterest.com.invalid
> >
> wrote:
>
> > Thanks for the detailed pointers, Gian! In light of the ongoing
> discussion
> > around on-list development, does this seem like something that's
> worthwhile
> > to anyone else in the community?
> >
> > On Tue, Jan 8, 2019 at 10:32 AM Gian Merlino <g...@apache.org> wrote:
> >
> > > Hey Julian,
> > >
> > > There aren't any gotchas that I can think of other than the fact that
> > they
> > > are not super well documented, and you might miss some features if
> you're
> > > just skimming the code. A couple points that might matter,
> > >
> > > 1) PartitionChunk is what allows a shard spec to contribute to the
> > > definition of whether a set of segments for a time chunk is "complete".
> > > It's an important concept since the broker will not query segment sets
> > > unless the chunk is complete. The way the completeness check works is
> > > basically that the broker will get all the ShardSpecs for all the
> > segments
> > > in a time chunk, order them by partitionNum, generate the partition
> > chunks,
> > > and check if (a) the first one is a starter based on "isStart", (b) any
> > > subsequent ones until the end [based on "isEnd"] abut the previous one
> > > [based on "abuts"]. Some ShardSpecs use nonsensical-at-first-glance
> logic
> > > from these methods to short circuit the completeness checks: time
> chunks
> > > with LinearShardSpecs are _always_ considered complete. Time chunks
> with
> > > NumberedShardSpecs can have "partitionNum" go beyond "partitions", and
> > are
> > > considered complete if the first "partitions" number of segments are
> > > present.
> > >
> > > 2) "getDomainDimensions" and "possibleInDomain" are optional, but
> useful
> > > for powering broker-side segment pruning.
> > >
> > > 3) All segments in the same time chunk must have the same kind of
> > > ShardSpec. However, it can vary from time chunk to time chunk within a
> > > datasource.
> > >
> > > On Mon, Jan 7, 2019 at 2:56 PM Julian Jaffe
> <jja...@pinterest.com.invalid
> > >
> > > wrote:
> > >
> > > > Hey all,
> > > >
> > > > Are there any major caveats or gotchas I should be aware of when
> > > > implementing a new ShardSpec? The context here is that we have a
> > > datasource
> > > > that is the combined result of multiple input jobs. We're trying to
> do
> > > > write-side joining by having all of the jobs write segments for the
> > same
> > > > intervals (e.g. partitioning on both partition number and source
> > > pipeline).
> > > > For now, I've modified the Spark-Druid batch ingestor (
> > > > https://github.com/metamx/druid-spark-batch) to run in our various
> > > > pipelines and to write out segments with identifier form
> > > >
> `dataSource_startInterval_endInterval_version_sourceName_partitionNum.
> > > This
> > > > is working without issue for loading, querying, and deleting data,
> but
> > > the
> > > > metadata API reports the incorrect segment identifier, since it
> > > > reconstructs the identifier instead of reading from metadata (e.g. it
> > > > reports segment identifiers of the form
> > > > `dataSource_startInterval_endInterval_version_partitionNum`). Both
> > > because
> > > > we'd like this to be fully supported, and because we imagine that
> this
> > > > feature may be useful to others, I'd like to implement this via a
> > > > ShardSpec.
> > > >
> > > > Julian
> > > >
> > >
> >
>

Re: Pointers on implementing a new ShardSpec

Reply via email to