HadoopIndexer job with input as the Druid datasource and configured segments table

Samarth Jain Mon, 15 Apr 2019 13:39:58 -0700

Hi,

We are building out a realtime ingestion pipeline using Kafka Indexing
service for Druid. In order to achieve better rollup, I was trying out the
hadoop based reingestion job
http://druid.io/docs/latest/ingestion/update-existing-data.html  which
basically uses the datasource itself as the input.


When I ran the job, it failed because it was trying to read segment
metadata from druid_segments table and not from the table,
customprefix_segments, I specified in the metadataUpdateSpec.

"metadataUpdateSpec": {
      "connectURI": "jdbc:mysql...",
      "password": "XXXXXXX",
      "segmentTable": "customprefix_segments",
      "type": "mysql",
      "user": "XXXXXXXX"
},

Looking at the code, I see that the segmentTable specified in the spec is
actually passed in as pending_segments table (3rd param is for
pending_segments and 4th param is for segments table)
https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/updater/MetadataStorageUpdaterJobSpec.java#L92

As a result, the re-ingestion job tries to read from the default segments
table named DRUID_SEGMENTS which isn't present.

Is this intentional or a bug?

Is there a way to configure the segments table name for this kind of
re-ingestion job?

HadoopIndexer job with input as the Druid datasource and configured segments table

Reply via email to