Re: HadoopIndexer job with input as the Druid datasource and configured segments table

Jihoon Son Tue, 16 Apr 2019 21:14:35 -0700

It looks not documented, but it should work:
https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/common/task/HadoopIndexTask.java#L146
.


You can simply put context as below:

"context" : {
  "druid.indexer.fork.property.druid.metadata.storage.tables.segments":
"your_table_name"
}

On Tue, Apr 16, 2019 at 8:29 PM Samarth Jain <samarth.j...@gmail.com> wrote:

> I am not sure if there is a way to provide taskContext for hadoop based
> ingestion. Will take a look into fixing this issue properly.
>
> On Tue, Apr 16, 2019 at 12:54 PM Jihoon Son <ghoon...@gmail.com> wrote:
>
> > From the code here:
> >
> >
> https://github.com/apache/incubator-druid/blob/master/indexing-service/src/main/java/org/apache/druid/indexing/overlord/ForkingTaskRunner.java#L340-L353
> > ,
> > I think you can put
> > "druid.indexer.fork.property.druid.metadata.storage.tables.segments" in
> the
> > task context.
> > I haven't tested though.
> >
> > On Tue, Apr 16, 2019 at 11:05 AM Samarth Jain <samarth.j...@gmail.com>
> > wrote:
> >
> > > You are right about the code path, Jihoon. Is there a way to provide
> > > the "druid.metadata.storage.tables"
> > > config in the ingestion spec itself? Clearly providing it as part of
> > > metadataUpdateSpec doesn't work.
> > > Or does it have to be in some properties file that is in the classpath
> of
> > > the indexer job?
> > >
> > > On Mon, Apr 15, 2019 at 3:43 PM Jihoon Son <ghoon...@gmail.com> wrote:
> > >
> > > > It would be clearer if you have the full stack trace, but I guess the
> > > task
> > > > failed at
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-druid/blob/0.12.2/indexing-service/src/main/java/io/druid/indexing/common/task/HadoopIndexTask.java#L178
> > > > .
> > > > If dataSource inputSpec is used, hadoop task lists input segments
> > stored
> > > in
> > > > the metadata store which end up calling
> > > >
> > IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle().
> > > > This method calls MetadataStorageTablesConfig.getSegmentsTable() to
> get
> > > the
> > > > segments table name (
> > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-druid/blob/0.12.2/server/src/main/java/io/druid/metadata/IndexerSQLMetadataStorageCoordinator.java#L237
> > > > ).
> > > > So, maybe configuration is wrong like
> > > > "druid.metadata.storage.tables.segments"?
> > > >
> > > > On Mon, Apr 15, 2019 at 3:09 PM Samarth Jain <samarth.j...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > The updater job runs after the indexes are generated, right?
> > > > >
> > > > > My job fails even before the mapper runs.
> > > > >
> > > > > Caused by:
> com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
> > > > > Table 'druid.druid_segments' doesn't exist
> > > > >         at
> > > sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > > Method) ~[?:1.8.0_92]
> > > > >         at
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> > > > > ~[?:1.8.0_92]
> > > > >         at
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > ~[?:1.8.0_92]
> > > > >         at
> > > > java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> > > > > ~[?:1.8.0_92]
> > > > >         at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at com.mysql.jdbc.Util.getInstance(Util.java:387)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at
> > > com.mysql.jdbc.SQLError.createSQLException(SQLError.java:939)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at
> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3878)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at
> com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at
> > > > com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at
> > > > >
> > > >
> > >
> >
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at
> > > > >
> com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1192)
> > > > > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198)
> > > > > ~[commons-dbcp2-2.0.1.jar:2.0.1]
> > > > >         at
> > > > >
> > > >
> > >
> >
> org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198)
> > > > > ~[commons-dbcp2-2.0.1.jar:2.0.1]
> > > > >         at
> > > > >
> > org.skife.jdbi.v2.SQLStatement.internalExecute(SQLStatement.java:1328)
> > > > > ~[jdbi-2.63.1.jar:2.63.1]
> > > > >         at org.skife.jdbi.v2.Query.iterator(Query.java:240)
> > > > > ~[jdbi-2.63.1.jar:2.63.1]
> > > > >         at
> > > > >
> > > >
> > >
> >
> io.druid.metadata.IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle(IndexerSQLMetadataStorageCoordinator.java:250)
> > > > > ~[druid-server-0.12.2.jar:0.12.2]
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Mon, Apr 15, 2019 at 2:46 PM Jihoon Son <ghoon...@gmail.com>
> > wrote:
> > > > >
> > > > > > Ah yeah, it sounds reasonable.
> > > > > >
> > > > > > > We have been configuring the segments table for our hadoop
> based
> > > > batch
> > > > > > > ingestion for a long time. I am unclear though how things have
> > been
> > > > > > working
> > > > > > > out so far even with this bug. Probably the look up is done
> both
> > on
> > > > > > > pending_segments table and segments table?
> > > > > >
> > > > > > I guess it's because getMetadataStorageTablesConfig() method is
> > only
> > > > > being
> > > > > > used in CliInternalHadoopIndexer.
> > > > > > And now I'm not sure why your reingestion job failed.. It should
> > use
> > > > the
> > > > > > table specified in the spec:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/MetadataStorageUpdaterJob.java#L46
> > > > > > .
> > > > > >
> > > > > > Jihoon
> > > > > >
> > > > > > On Mon, Apr 15, 2019 at 2:35 PM Samarth Jain <
> > samarth.j...@gmail.com
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > We have been configuring the segments table for our hadoop
> based
> > > > batch
> > > > > > > ingestion for a long time. I am unclear though how things have
> > been
> > > > > > working
> > > > > > > out so far even with this bug. Probably the look up is done
> both
> > on
> > > > > > > pending_segments table and segments table?
> > > > > > >
> > > > > > > As for the need behind this kind of job - because every Kafka
> > > indexer
> > > > > > reads
> > > > > > > from a random set of partitions, it doesn't get the perfect
> > rollup.
> > > > We
> > > > > > > tried running compaction but we noticed that only one
> compaction
> > > task
> > > > > > runs
> > > > > > > for the entire interval which would take a long time.
> > > > > > > So as a more scalable alternative, I was trying out reingesting
> > the
> > > > > > > datasource where a mapper is created for every segment file.
> > > > > > >
> > > > > > > ioConfig": {
> > > > > > >     "inputSpec": {
> > > > > > >       "type": "dataSource",
> > > > > > >       "ingestionSpec": {
> > > > > > >         "dataSource": "ds_name",
> > > > > > >         "intervals": [
> > > > > > >           "2019-03-28T22:00:00.000Z/2019-03-28T23:00:00.000Z"
> > > > > > >         ]
> > > > > > >       }
> > > > > > >     },
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Mon, Apr 15, 2019 at 2:21 PM Jihoon Son <ghoon...@gmail.com
> >
> > > > wrote:
> > > > > > >
> > > > > > > > Thank you for raising!
> > > > > > > >
> > > > > > > > FYI, the pending segments table is used only when appending
> > > > segments,
> > > > > > and
> > > > > > > > the Hadoop task always overwrites.
> > > > > > > > I guess you're testing "multi" inputSpec for compaction, but
> > the
> > > > > Hadoop
> > > > > > > > task will still read entire input segments and overwrites
> them
> > > with
> > > > > new
> > > > > > > > segments.
> > > > > > > >
> > > > > > > > Jihoon
> > > > > > > >
> > > > > > > > On Mon, Apr 15, 2019 at 2:12 PM Samarth Jain <
> > > > samarth.j...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for the reply, Jihoon. I am slightly worried about
> > > simply
> > > > > > > > switching
> > > > > > > > > the parameter values as described above since the two
> tables
> > > are
> > > > > used
> > > > > > > > > extensively in the code base. Will raise an issue.
> > > > > > > > >
> > > > > > > > > On Mon, Apr 15, 2019 at 2:04 PM Jihoon Son <
> > ghoon...@gmail.com
> > > >
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi Samarth,
> > > > > > > > > >
> > > > > > > > > > it definitely looks a bug to me.
> > > > > > > > > > I'm not sure there's a workaround for this problem
> though.
> > > > > > > > > >
> > > > > > > > > > Jihoon
> > > > > > > > > >
> > > > > > > > > > On Mon, Apr 15, 2019 at 1:39 PM Samarth Jain <
> > > > > > samarth.j...@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi,
> > > > > > > > > > >
> > > > > > > > > > > We are building out a realtime ingestion pipeline using
> > > Kafka
> > > > > > > > Indexing
> > > > > > > > > > > service for Druid. In order to achieve better rollup, I
> > was
> > > > > > trying
> > > > > > > > out
> > > > > > > > > > the
> > > > > > > > > > > hadoop based reingestion job
> > > > > > > > > > >
> > > > > http://druid.io/docs/latest/ingestion/update-existing-data.html
> > > > > > > > which
> > > > > > > > > > > basically uses the datasource itself as the input.
> > > > > > > > > > >
> > > > > > > > > > > When I ran the job, it failed because it was trying to
> > read
> > > > > > segment
> > > > > > > > > > > metadata from druid_segments table and not from the
> > table,
> > > > > > > > > > > customprefix_segments, I specified in the
> > > metadataUpdateSpec.
> > > > > > > > > > >
> > > > > > > > > > > "metadataUpdateSpec": {
> > > > > > > > > > >       "connectURI": "jdbc:mysql...",
> > > > > > > > > > >       "password": "XXXXXXX",
> > > > > > > > > > >       "segmentTable": "customprefix_segments",
> > > > > > > > > > >       "type": "mysql",
> > > > > > > > > > >       "user": "XXXXXXXX"
> > > > > > > > > > > },
> > > > > > > > > > >
> > > > > > > > > > > Looking at the code, I see that the segmentTable
> > specified
> > > in
> > > > > the
> > > > > > > > spec
> > > > > > > > > is
> > > > > > > > > > > actually passed in as pending_segments table (3rd param
> > is
> > > > for
> > > > > > > > > > > pending_segments and 4th param is for segments table)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/updater/MetadataStorageUpdaterJobSpec.java#L92
> > > > > > > > > > >
> > > > > > > > > > > As a result, the re-ingestion job tries to read from
> the
> > > > > default
> > > > > > > > > segments
> > > > > > > > > > > table named DRUID_SEGMENTS which isn't present.
> > > > > > > > > > >
> > > > > > > > > > > Is this intentional or a bug?
> > > > > > > > > > >
> > > > > > > > > > > Is there a way to configure the segments table name for
> > > this
> > > > > kind
> > > > > > > of
> > > > > > > > > > > re-ingestion job?
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HadoopIndexer job with input as the Druid datasource and configured segments table

Reply via email to