Ah yeah, it sounds reasonable. > We have been configuring the segments table for our hadoop based batch > ingestion for a long time. I am unclear though how things have been working > out so far even with this bug. Probably the look up is done both on > pending_segments table and segments table?
I guess it's because getMetadataStorageTablesConfig() method is only being used in CliInternalHadoopIndexer. And now I'm not sure why your reingestion job failed.. It should use the table specified in the spec: https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/MetadataStorageUpdaterJob.java#L46 . Jihoon On Mon, Apr 15, 2019 at 2:35 PM Samarth Jain <samarth.j...@gmail.com> wrote: > We have been configuring the segments table for our hadoop based batch > ingestion for a long time. I am unclear though how things have been working > out so far even with this bug. Probably the look up is done both on > pending_segments table and segments table? > > As for the need behind this kind of job - because every Kafka indexer reads > from a random set of partitions, it doesn't get the perfect rollup. We > tried running compaction but we noticed that only one compaction task runs > for the entire interval which would take a long time. > So as a more scalable alternative, I was trying out reingesting the > datasource where a mapper is created for every segment file. > > ioConfig": { > "inputSpec": { > "type": "dataSource", > "ingestionSpec": { > "dataSource": "ds_name", > "intervals": [ > "2019-03-28T22:00:00.000Z/2019-03-28T23:00:00.000Z" > ] > } > }, > > > > On Mon, Apr 15, 2019 at 2:21 PM Jihoon Son <ghoon...@gmail.com> wrote: > > > Thank you for raising! > > > > FYI, the pending segments table is used only when appending segments, and > > the Hadoop task always overwrites. > > I guess you're testing "multi" inputSpec for compaction, but the Hadoop > > task will still read entire input segments and overwrites them with new > > segments. > > > > Jihoon > > > > On Mon, Apr 15, 2019 at 2:12 PM Samarth Jain <samarth.j...@gmail.com> > > wrote: > > > > > Thanks for the reply, Jihoon. I am slightly worried about simply > > switching > > > the parameter values as described above since the two tables are used > > > extensively in the code base. Will raise an issue. > > > > > > On Mon, Apr 15, 2019 at 2:04 PM Jihoon Son <ghoon...@gmail.com> wrote: > > > > > > > Hi Samarth, > > > > > > > > it definitely looks a bug to me. > > > > I'm not sure there's a workaround for this problem though. > > > > > > > > Jihoon > > > > > > > > On Mon, Apr 15, 2019 at 1:39 PM Samarth Jain <samarth.j...@gmail.com > > > > > > wrote: > > > > > > > > > Hi, > > > > > > > > > > We are building out a realtime ingestion pipeline using Kafka > > Indexing > > > > > service for Druid. In order to achieve better rollup, I was trying > > out > > > > the > > > > > hadoop based reingestion job > > > > > http://druid.io/docs/latest/ingestion/update-existing-data.html > > which > > > > > basically uses the datasource itself as the input. > > > > > > > > > > When I ran the job, it failed because it was trying to read segment > > > > > metadata from druid_segments table and not from the table, > > > > > customprefix_segments, I specified in the metadataUpdateSpec. > > > > > > > > > > "metadataUpdateSpec": { > > > > > "connectURI": "jdbc:mysql...", > > > > > "password": "XXXXXXX", > > > > > "segmentTable": "customprefix_segments", > > > > > "type": "mysql", > > > > > "user": "XXXXXXXX" > > > > > }, > > > > > > > > > > Looking at the code, I see that the segmentTable specified in the > > spec > > > is > > > > > actually passed in as pending_segments table (3rd param is for > > > > > pending_segments and 4th param is for segments table) > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/updater/MetadataStorageUpdaterJobSpec.java#L92 > > > > > > > > > > As a result, the re-ingestion job tries to read from the default > > > segments > > > > > table named DRUID_SEGMENTS which isn't present. > > > > > > > > > > Is this intentional or a bug? > > > > > > > > > > Is there a way to configure the segments table name for this kind > of > > > > > re-ingestion job? > > > > > > > > > > > > > > >