It looks not documented, but it should work: https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/common/task/HadoopIndexTask.java#L146 .
You can simply put context as below: "context" : { "druid.indexer.fork.property.druid.metadata.storage.tables.segments": "your_table_name" } On Tue, Apr 16, 2019 at 8:29 PM Samarth Jain <samarth.j...@gmail.com> wrote: > I am not sure if there is a way to provide taskContext for hadoop based > ingestion. Will take a look into fixing this issue properly. > > On Tue, Apr 16, 2019 at 12:54 PM Jihoon Son <ghoon...@gmail.com> wrote: > > > From the code here: > > > > > https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/overlord/ForkingTaskRunner.java#L340-L353 > > , > > I think you can put > > "druid.indexer.fork.property.druid.metadata.storage.tables.segments" in > the > > task context. > > I haven't tested though. > > > > On Tue, Apr 16, 2019 at 11:05 AM Samarth Jain <samarth.j...@gmail.com> > > wrote: > > > > > You are right about the code path, Jihoon. Is there a way to provide > > > the "druid.metadata.storage.tables" > > > config in the ingestion spec itself? Clearly providing it as part of > > > metadataUpdateSpec doesn't work. > > > Or does it have to be in some properties file that is in the classpath > of > > > the indexer job? > > > > > > On Mon, Apr 15, 2019 at 3:43 PM Jihoon Son <ghoon...@gmail.com> wrote: > > > > > > > It would be clearer if you have the full stack trace, but I guess the > > > task > > > > failed at > > > > > > > > > > > > > > https://github.com/apache/incubator-druid/blob/0.12.2/indexing-service/src/main/java/io/druid/indexing/common/task/HadoopIndexTask.java#L178 > > > > . > > > > If dataSource inputSpec is used, hadoop task lists input segments > > stored > > > in > > > > the metadata store which end up calling > > > > > > IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle(). > > > > This method calls MetadataStorageTablesConfig.getSegmentsTable() to > get > > > the > > > > segments table name ( > > > > > > > > > > > > > > https://github.com/apache/incubator-druid/blob/0.12.2/server/src/main/java/io/druid/metadata/IndexerSQLMetadataStorageCoordinator.java#L237 > > > > ). > > > > So, maybe configuration is wrong like > > > > "druid.metadata.storage.tables.segments"? > > > > > > > > On Mon, Apr 15, 2019 at 3:09 PM Samarth Jain <samarth.j...@gmail.com > > > > > > wrote: > > > > > > > > > The updater job runs after the indexes are generated, right? > > > > > > > > > > My job fails even before the mapper runs. > > > > > > > > > > Caused by: > com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException: > > > > > Table 'druid.druid_segments' doesn't exist > > > > > at > > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > > > > > Method) ~[?:1.8.0_92] > > > > > at > > > > > > > > > > > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > > > > > ~[?:1.8.0_92] > > > > > at > > > > > > > > > > > > > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > > > > ~[?:1.8.0_92] > > > > > at > > > > java.lang.reflect.Constructor.newInstance(Constructor.java:423) > > > > > ~[?:1.8.0_92] > > > > > at com.mysql.jdbc.Util.handleNewInstance(Util.java:404) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at com.mysql.jdbc.Util.getInstance(Util.java:387) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at > > > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:939) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3878) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at > com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at > > > > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at > > > > > > > > > > > > > > > com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at > > > > > > com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1192) > > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38] > > > > > at > > > > > > > > > > > > > > > org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198) > > > > > ~[commons-dbcp2-2.0.1.jar:2.0.1] > > > > > at > > > > > > > > > > > > > > > org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198) > > > > > ~[commons-dbcp2-2.0.1.jar:2.0.1] > > > > > at > > > > > > > org.skife.jdbi.v2.SQLStatement.internalExecute(SQLStatement.java:1328) > > > > > ~[jdbi-2.63.1.jar:2.63.1] > > > > > at org.skife.jdbi.v2.Query.iterator(Query.java:240) > > > > > ~[jdbi-2.63.1.jar:2.63.1] > > > > > at > > > > > > > > > > > > > > > io.druid.metadata.IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle(IndexerSQLMetadataStorageCoordinator.java:250) > > > > > ~[druid-server-0.12.2.jar:0.12.2] > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 15, 2019 at 2:46 PM Jihoon Son <ghoon...@gmail.com> > > wrote: > > > > > > > > > > > Ah yeah, it sounds reasonable. > > > > > > > > > > > > > We have been configuring the segments table for our hadoop > based > > > > batch > > > > > > > ingestion for a long time. I am unclear though how things have > > been > > > > > > working > > > > > > > out so far even with this bug. Probably the look up is done > both > > on > > > > > > > pending_segments table and segments table? > > > > > > > > > > > > I guess it's because getMetadataStorageTablesConfig() method is > > only > > > > > being > > > > > > used in CliInternalHadoopIndexer. > > > > > > And now I'm not sure why your reingestion job failed.. It should > > use > > > > the > > > > > > table specified in the spec: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/MetadataStorageUpdaterJob.java#L46 > > > > > > . > > > > > > > > > > > > Jihoon > > > > > > > > > > > > On Mon, Apr 15, 2019 at 2:35 PM Samarth Jain < > > samarth.j...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > We have been configuring the segments table for our hadoop > based > > > > batch > > > > > > > ingestion for a long time. I am unclear though how things have > > been > > > > > > working > > > > > > > out so far even with this bug. Probably the look up is done > both > > on > > > > > > > pending_segments table and segments table? > > > > > > > > > > > > > > As for the need behind this kind of job - because every Kafka > > > indexer > > > > > > reads > > > > > > > from a random set of partitions, it doesn't get the perfect > > rollup. > > > > We > > > > > > > tried running compaction but we noticed that only one > compaction > > > task > > > > > > runs > > > > > > > for the entire interval which would take a long time. > > > > > > > So as a more scalable alternative, I was trying out reingesting > > the > > > > > > > datasource where a mapper is created for every segment file. > > > > > > > > > > > > > > ioConfig": { > > > > > > > "inputSpec": { > > > > > > > "type": "dataSource", > > > > > > > "ingestionSpec": { > > > > > > > "dataSource": "ds_name", > > > > > > > "intervals": [ > > > > > > > "2019-03-28T22:00:00.000Z/2019-03-28T23:00:00.000Z" > > > > > > > ] > > > > > > > } > > > > > > > }, > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Mon, Apr 15, 2019 at 2:21 PM Jihoon Son <ghoon...@gmail.com > > > > > > wrote: > > > > > > > > > > > > > > > Thank you for raising! > > > > > > > > > > > > > > > > FYI, the pending segments table is used only when appending > > > > segments, > > > > > > and > > > > > > > > the Hadoop task always overwrites. > > > > > > > > I guess you're testing "multi" inputSpec for compaction, but > > the > > > > > Hadoop > > > > > > > > task will still read entire input segments and overwrites > them > > > with > > > > > new > > > > > > > > segments. > > > > > > > > > > > > > > > > Jihoon > > > > > > > > > > > > > > > > On Mon, Apr 15, 2019 at 2:12 PM Samarth Jain < > > > > samarth.j...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks for the reply, Jihoon. I am slightly worried about > > > simply > > > > > > > > switching > > > > > > > > > the parameter values as described above since the two > tables > > > are > > > > > used > > > > > > > > > extensively in the code base. Will raise an issue. > > > > > > > > > > > > > > > > > > On Mon, Apr 15, 2019 at 2:04 PM Jihoon Son < > > ghoon...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Hi Samarth, > > > > > > > > > > > > > > > > > > > > it definitely looks a bug to me. > > > > > > > > > > I'm not sure there's a workaround for this problem > though. > > > > > > > > > > > > > > > > > > > > Jihoon > > > > > > > > > > > > > > > > > > > > On Mon, Apr 15, 2019 at 1:39 PM Samarth Jain < > > > > > > samarth.j...@gmail.com > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > > > We are building out a realtime ingestion pipeline using > > > Kafka > > > > > > > > Indexing > > > > > > > > > > > service for Druid. In order to achieve better rollup, I > > was > > > > > > trying > > > > > > > > out > > > > > > > > > > the > > > > > > > > > > > hadoop based reingestion job > > > > > > > > > > > > > > > > http://druid.io/docs/latest/ingestion/update-existing-data.html > > > > > > > > which > > > > > > > > > > > basically uses the datasource itself as the input. > > > > > > > > > > > > > > > > > > > > > > When I ran the job, it failed because it was trying to > > read > > > > > > segment > > > > > > > > > > > metadata from druid_segments table and not from the > > table, > > > > > > > > > > > customprefix_segments, I specified in the > > > metadataUpdateSpec. > > > > > > > > > > > > > > > > > > > > > > "metadataUpdateSpec": { > > > > > > > > > > > "connectURI": "jdbc:mysql...", > > > > > > > > > > > "password": "XXXXXXX", > > > > > > > > > > > "segmentTable": "customprefix_segments", > > > > > > > > > > > "type": "mysql", > > > > > > > > > > > "user": "XXXXXXXX" > > > > > > > > > > > }, > > > > > > > > > > > > > > > > > > > > > > Looking at the code, I see that the segmentTable > > specified > > > in > > > > > the > > > > > > > > spec > > > > > > > > > is > > > > > > > > > > > actually passed in as pending_segments table (3rd param > > is > > > > for > > > > > > > > > > > pending_segments and 4th param is for segments table) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/updater/MetadataStorageUpdaterJobSpec.java#L92 > > > > > > > > > > > > > > > > > > > > > > As a result, the re-ingestion job tries to read from > the > > > > > default > > > > > > > > > segments > > > > > > > > > > > table named DRUID_SEGMENTS which isn't present. > > > > > > > > > > > > > > > > > > > > > > Is this intentional or a bug? > > > > > > > > > > > > > > > > > > > > > > Is there a way to configure the segments table name for > > > this > > > > > kind > > > > > > > of > > > > > > > > > > > re-ingestion job? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >