Re: Pointers on implementing a new ShardSpec

Julian Jaffe Thu, 10 Jan 2019 15:09:40 -0800

Thanks for the detailed pointers, Gian! In light of the ongoing discussion
around on-list development, does this seem like something that's worthwhile
to anyone else in the community?


On Tue, Jan 8, 2019 at 10:32 AM Gian Merlino <g...@apache.org> wrote:

> Hey Julian,
>
> There aren't any gotchas that I can think of other than the fact that they
> are not super well documented, and you might miss some features if you're
> just skimming the code. A couple points that might matter,
>
> 1) PartitionChunk is what allows a shard spec to contribute to the
> definition of whether a set of segments for a time chunk is "complete".
> It's an important concept since the broker will not query segment sets
> unless the chunk is complete. The way the completeness check works is
> basically that the broker will get all the ShardSpecs for all the segments
> in a time chunk, order them by partitionNum, generate the partition chunks,
> and check if (a) the first one is a starter based on "isStart", (b) any
> subsequent ones until the end [based on "isEnd"] abut the previous one
> [based on "abuts"]. Some ShardSpecs use nonsensical-at-first-glance logic
> from these methods to short circuit the completeness checks: time chunks
> with LinearShardSpecs are _always_ considered complete. Time chunks with
> NumberedShardSpecs can have "partitionNum" go beyond "partitions", and are
> considered complete if the first "partitions" number of segments are
> present.
>
> 2) "getDomainDimensions" and "possibleInDomain" are optional, but useful
> for powering broker-side segment pruning.
>
> 3) All segments in the same time chunk must have the same kind of
> ShardSpec. However, it can vary from time chunk to time chunk within a
> datasource.
>
> On Mon, Jan 7, 2019 at 2:56 PM Julian Jaffe <jja...@pinterest.com.invalid>
> wrote:
>
> > Hey all,
> >
> > Are there any major caveats or gotchas I should be aware of when
> > implementing a new ShardSpec? The context here is that we have a
> datasource
> > that is the combined result of multiple input jobs. We're trying to do
> > write-side joining by having all of the jobs write segments for the same
> > intervals (e.g. partitioning on both partition number and source
> pipeline).
> > For now, I've modified the Spark-Druid batch ingestor (
> > https://github.com/metamx/druid-spark-batch) to run in our various
> > pipelines and to write out segments with identifier form
> > `dataSource_startInterval_endInterval_version_sourceName_partitionNum.
> This
> > is working without issue for loading, querying, and deleting data, but
> the
> > metadata API reports the incorrect segment identifier, since it
> > reconstructs the identifier instead of reading from metadata (e.g. it
> > reports segment identifiers of the form
> > `dataSource_startInterval_endInterval_version_partitionNum`). Both
> because
> > we'd like this to be fully supported, and because we imagine that this
> > feature may be useful to others, I'd like to implement this via a
> > ShardSpec.
> >
> > Julian
> >
>

Re: Pointers on implementing a new ShardSpec

Reply via email to