Re: HadoopIndexer job with input as the Druid datasource and configured segments table

Samarth Jain Tue, 16 Apr 2019 11:05:24 -0700

You are right about the code path, Jihoon. Is there a way to provide
the "druid.metadata.storage.tables"
config in the ingestion spec itself? Clearly providing it as part of
metadataUpdateSpec doesn't work.
Or does it have to be in some properties file that is in the classpath of
the indexer job?


On Mon, Apr 15, 2019 at 3:43 PM Jihoon Son <ghoon...@gmail.com> wrote:

> It would be clearer if you have the full stack trace, but I guess the task
> failed at
>
> https://github.com/apache/incubator-druid/blob/0.12.2/indexing-service/src/main/java/io/druid/indexing/common/task/HadoopIndexTask.java#L178
> .
> If dataSource inputSpec is used, hadoop task lists input segments stored in
> the metadata store which end up calling
> IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle().
> This method calls MetadataStorageTablesConfig.getSegmentsTable() to get the
> segments table name (
>
> https://github.com/apache/incubator-druid/blob/0.12.2/server/src/main/java/io/druid/metadata/IndexerSQLMetadataStorageCoordinator.java#L237
> ).
> So, maybe configuration is wrong like
> "druid.metadata.storage.tables.segments"?
>
> On Mon, Apr 15, 2019 at 3:09 PM Samarth Jain <samarth.j...@gmail.com>
> wrote:
>
> > The updater job runs after the indexes are generated, right?
> >
> > My job fails even before the mapper runs.
> >
> > Caused by: com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:
> > Table 'druid.druid_segments' doesn't exist
> >         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > Method) ~[?:1.8.0_92]
> >         at
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
> > ~[?:1.8.0_92]
> >         at
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > ~[?:1.8.0_92]
> >         at
> java.lang.reflect.Constructor.newInstance(Constructor.java:423)
> > ~[?:1.8.0_92]
> >         at com.mysql.jdbc.Util.handleNewInstance(Util.java:404)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at com.mysql.jdbc.Util.getInstance(Util.java:387)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at com.mysql.jdbc.SQLError.createSQLException(SQLError.java:939)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3878)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at com.mysql.jdbc.MysqlIO.checkErrorPacket(MysqlIO.java:3814)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2478)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2625)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at
> com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2551)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at
> >
> com.mysql.jdbc.PreparedStatement.executeInternal(PreparedStatement.java:1861)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at
> > com.mysql.jdbc.PreparedStatement.execute(PreparedStatement.java:1192)
> > ~[mysql-connector-java-5.1.38.jar:5.1.38]
> >         at
> >
> org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198)
> > ~[commons-dbcp2-2.0.1.jar:2.0.1]
> >         at
> >
> org.apache.commons.dbcp2.DelegatingPreparedStatement.execute(DelegatingPreparedStatement.java:198)
> > ~[commons-dbcp2-2.0.1.jar:2.0.1]
> >         at
> > org.skife.jdbi.v2.SQLStatement.internalExecute(SQLStatement.java:1328)
> > ~[jdbi-2.63.1.jar:2.63.1]
> >         at org.skife.jdbi.v2.Query.iterator(Query.java:240)
> > ~[jdbi-2.63.1.jar:2.63.1]
> >         at
> >
> io.druid.metadata.IndexerSQLMetadataStorageCoordinator.getTimelineForIntervalsWithHandle(IndexerSQLMetadataStorageCoordinator.java:250)
> > ~[druid-server-0.12.2.jar:0.12.2]
> >
> >
> >
> >
> > On Mon, Apr 15, 2019 at 2:46 PM Jihoon Son <ghoon...@gmail.com> wrote:
> >
> > > Ah yeah, it sounds reasonable.
> > >
> > > > We have been configuring the segments table for our hadoop based
> batch
> > > > ingestion for a long time. I am unclear though how things have been
> > > working
> > > > out so far even with this bug. Probably the look up is done both on
> > > > pending_segments table and segments table?
> > >
> > > I guess it's because getMetadataStorageTablesConfig() method is only
> > being
> > > used in CliInternalHadoopIndexer.
> > > And now I'm not sure why your reingestion job failed.. It should use
> the
> > > table specified in the spec:
> > >
> > >
> >
> https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/MetadataStorageUpdaterJob.java#L46
> > > .
> > >
> > > Jihoon
> > >
> > > On Mon, Apr 15, 2019 at 2:35 PM Samarth Jain <samarth.j...@gmail.com>
> > > wrote:
> > >
> > > > We have been configuring the segments table for our hadoop based
> batch
> > > > ingestion for a long time. I am unclear though how things have been
> > > working
> > > > out so far even with this bug. Probably the look up is done both on
> > > > pending_segments table and segments table?
> > > >
> > > > As for the need behind this kind of job - because every Kafka indexer
> > > reads
> > > > from a random set of partitions, it doesn't get the perfect rollup.
> We
> > > > tried running compaction but we noticed that only one compaction task
> > > runs
> > > > for the entire interval which would take a long time.
> > > > So as a more scalable alternative, I was trying out reingesting the
> > > > datasource where a mapper is created for every segment file.
> > > >
> > > > ioConfig": {
> > > >     "inputSpec": {
> > > >       "type": "dataSource",
> > > >       "ingestionSpec": {
> > > >         "dataSource": "ds_name",
> > > >         "intervals": [
> > > >           "2019-03-28T22:00:00.000Z/2019-03-28T23:00:00.000Z"
> > > >         ]
> > > >       }
> > > >     },
> > > >
> > > >
> > > >
> > > > On Mon, Apr 15, 2019 at 2:21 PM Jihoon Son <ghoon...@gmail.com>
> wrote:
> > > >
> > > > > Thank you for raising!
> > > > >
> > > > > FYI, the pending segments table is used only when appending
> segments,
> > > and
> > > > > the Hadoop task always overwrites.
> > > > > I guess you're testing "multi" inputSpec for compaction, but the
> > Hadoop
> > > > > task will still read entire input segments and overwrites them with
> > new
> > > > > segments.
> > > > >
> > > > > Jihoon
> > > > >
> > > > > On Mon, Apr 15, 2019 at 2:12 PM Samarth Jain <
> samarth.j...@gmail.com
> > >
> > > > > wrote:
> > > > >
> > > > > > Thanks for the reply, Jihoon. I am slightly worried about simply
> > > > > switching
> > > > > > the parameter values as described above since the two tables are
> > used
> > > > > > extensively in the code base. Will raise an issue.
> > > > > >
> > > > > > On Mon, Apr 15, 2019 at 2:04 PM Jihoon Son <ghoon...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Samarth,
> > > > > > >
> > > > > > > it definitely looks a bug to me.
> > > > > > > I'm not sure there's a workaround for this problem though.
> > > > > > >
> > > > > > > Jihoon
> > > > > > >
> > > > > > > On Mon, Apr 15, 2019 at 1:39 PM Samarth Jain <
> > > samarth.j...@gmail.com
> > > > >
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We are building out a realtime ingestion pipeline using Kafka
> > > > > Indexing
> > > > > > > > service for Druid. In order to achieve better rollup, I was
> > > trying
> > > > > out
> > > > > > > the
> > > > > > > > hadoop based reingestion job
> > > > > > > >
> > http://druid.io/docs/latest/ingestion/update-existing-data.html
> > > > > which
> > > > > > > > basically uses the datasource itself as the input.
> > > > > > > >
> > > > > > > > When I ran the job, it failed because it was trying to read
> > > segment
> > > > > > > > metadata from druid_segments table and not from the table,
> > > > > > > > customprefix_segments, I specified in the metadataUpdateSpec.
> > > > > > > >
> > > > > > > > "metadataUpdateSpec": {
> > > > > > > >       "connectURI": "jdbc:mysql...",
> > > > > > > >       "password": "XXXXXXX",
> > > > > > > >       "segmentTable": "customprefix_segments",
> > > > > > > >       "type": "mysql",
> > > > > > > >       "user": "XXXXXXXX"
> > > > > > > > },
> > > > > > > >
> > > > > > > > Looking at the code, I see that the segmentTable specified in
> > the
> > > > > spec
> > > > > > is
> > > > > > > > actually passed in as pending_segments table (3rd param is
> for
> > > > > > > > pending_segments and 4th param is for segments table)
> > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/updater/MetadataStorageUpdaterJobSpec.java#L92
> > > > > > > >
> > > > > > > > As a result, the re-ingestion job tries to read from the
> > default
> > > > > > segments
> > > > > > > > table named DRUID_SEGMENTS which isn't present.
> > > > > > > >
> > > > > > > > Is this intentional or a bug?
> > > > > > > >
> > > > > > > > Is there a way to configure the segments table name for this
> > kind
> > > > of
> > > > > > > > re-ingestion job?
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: HadoopIndexer job with input as the Druid datasource and configured segments table

Reply via email to