Re: Pointers on implementing a new ShardSpec

Gian Merlino Tue, 08 Jan 2019 10:32:25 -0800

Hey Julian,

There aren't any gotchas that I can think of other than the fact that they
are not super well documented, and you might miss some features if you're
just skimming the code. A couple points that might matter,

1) PartitionChunk is what allows a shard spec to contribute to the
definition of whether a set of segments for a time chunk is "complete".
It's an important concept since the broker will not query segment sets
unless the chunk is complete. The way the completeness check works is
basically that the broker will get all the ShardSpecs for all the segments
in a time chunk, order them by partitionNum, generate the partition chunks,
and check if (a) the first one is a starter based on "isStart", (b) any
subsequent ones until the end [based on "isEnd"] abut the previous one
[based on "abuts"]. Some ShardSpecs use nonsensical-at-first-glance logic
from these methods to short circuit the completeness checks: time chunks
with LinearShardSpecs are _always_ considered complete. Time chunks with
NumberedShardSpecs can have "partitionNum" go beyond "partitions", and are
considered complete if the first "partitions" number of segments are
present.

2) "getDomainDimensions" and "possibleInDomain" are optional, but useful
for powering broker-side segment pruning.

3) All segments in the same time chunk must have the same kind of
ShardSpec. However, it can vary from time chunk to time chunk within a
datasource.

On Mon, Jan 7, 2019 at 2:56 PM Julian Jaffe <jja...@pinterest.com.invalid>
wrote:

> Hey all,
>
> Are there any major caveats or gotchas I should be aware of when
> implementing a new ShardSpec? The context here is that we have a datasource
> that is the combined result of multiple input jobs. We're trying to do
> write-side joining by having all of the jobs write segments for the same
> intervals (e.g. partitioning on both partition number and source pipeline).
> For now, I've modified the Spark-Druid batch ingestor (
> https://github.com/metamx/druid-spark-batch) to run in our various
> pipelines and to write out segments with identifier form
> `dataSource_startInterval_endInterval_version_sourceName_partitionNum. This
> is working without issue for loading, querying, and deleting data, but the
> metadata API reports the incorrect segment identifier, since it
> reconstructs the identifier instead of reading from metadata (e.g. it
> reports segment identifiers of the form
> `dataSource_startInterval_endInterval_version_partitionNum`). Both because
> we'd like this to be fully supported, and because we imagine that this
> feature may be useful to others, I'd like to implement this via a
> ShardSpec.
>
> Julian
>

Re: Pointers on implementing a new ShardSpec

Reply via email to