Pointers on implementing a new ShardSpec

Julian Jaffe Mon, 07 Jan 2019 14:57:49 -0800

Hey all,

Are there any major caveats or gotchas I should be aware of when
implementing a new ShardSpec? The context here is that we have a datasource
that is the combined result of multiple input jobs. We're trying to do
write-side joining by having all of the jobs write segments for the same
intervals (e.g. partitioning on both partition number and source pipeline).
For now, I've modified the Spark-Druid batch ingestor (
https://github.com/metamx/druid-spark-batch) to run in our various
pipelines and to write out segments with identifier form
`dataSource_startInterval_endInterval_version_sourceName_partitionNum. This
is working without issue for loading, querying, and deleting data, but the
metadata API reports the incorrect segment identifier, since it
reconstructs the identifier instead of reading from metadata (e.g. it
reports segment identifiers of the form
`dataSource_startInterval_endInterval_version_partitionNum`). Both because
we'd like this to be fully supported, and because we imagine that this
feature may be useful to others, I'd like to implement this via a ShardSpec.


Julian

Pointers on implementing a new ShardSpec

Reply via email to