Re: HadoopIndexer job with input as the Druid datasource and configured segments table

Jihoon Son Mon, 15 Apr 2019 14:46:35 -0700

Ah yeah, it sounds reasonable.

> We have been configuring the segments table for our hadoop based batch
> ingestion for a long time. I am unclear though how things have been
working
> out so far even with this bug. Probably the look up is done both on
> pending_segments table and segments table?


I guess it's because getMetadataStorageTablesConfig() method is only being
used in CliInternalHadoopIndexer.
And now I'm not sure why your reingestion job failed.. It should use the
table specified in the spec:
https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/MetadataStorageUpdaterJob.java#L46
.

Jihoon

On Mon, Apr 15, 2019 at 2:35 PM Samarth Jain <samarth.j...@gmail.com> wrote:

> We have been configuring the segments table for our hadoop based batch
> ingestion for a long time. I am unclear though how things have been working
> out so far even with this bug. Probably the look up is done both on
> pending_segments table and segments table?
>
> As for the need behind this kind of job - because every Kafka indexer reads
> from a random set of partitions, it doesn't get the perfect rollup. We
> tried running compaction but we noticed that only one compaction task runs
> for the entire interval which would take a long time.
> So as a more scalable alternative, I was trying out reingesting the
> datasource where a mapper is created for every segment file.
>
> ioConfig": {
>     "inputSpec": {
>       "type": "dataSource",
>       "ingestionSpec": {
>         "dataSource": "ds_name",
>         "intervals": [
>           "2019-03-28T22:00:00.000Z/2019-03-28T23:00:00.000Z"
>         ]
>       }
>     },
>
>
>
> On Mon, Apr 15, 2019 at 2:21 PM Jihoon Son <ghoon...@gmail.com> wrote:
>
> > Thank you for raising!
> >
> > FYI, the pending segments table is used only when appending segments, and
> > the Hadoop task always overwrites.
> > I guess you're testing "multi" inputSpec for compaction, but the Hadoop
> > task will still read entire input segments and overwrites them with new
> > segments.
> >
> > Jihoon
> >
> > On Mon, Apr 15, 2019 at 2:12 PM Samarth Jain <samarth.j...@gmail.com>
> > wrote:
> >
> > > Thanks for the reply, Jihoon. I am slightly worried about simply
> > switching
> > > the parameter values as described above since the two tables are used
> > > extensively in the code base. Will raise an issue.
> > >
> > > On Mon, Apr 15, 2019 at 2:04 PM Jihoon Son <ghoon...@gmail.com> wrote:
> > >
> > > > Hi Samarth,
> > > >
> > > > it definitely looks a bug to me.
> > > > I'm not sure there's a workaround for this problem though.
> > > >
> > > > Jihoon
> > > >
> > > > On Mon, Apr 15, 2019 at 1:39 PM Samarth Jain <samarth.j...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > We are building out a realtime ingestion pipeline using Kafka
> > Indexing
> > > > > service for Druid. In order to achieve better rollup, I was trying
> > out
> > > > the
> > > > > hadoop based reingestion job
> > > > > http://druid.io/docs/latest/ingestion/update-existing-data.html
> > which
> > > > > basically uses the datasource itself as the input.
> > > > >
> > > > > When I ran the job, it failed because it was trying to read segment
> > > > > metadata from druid_segments table and not from the table,
> > > > > customprefix_segments, I specified in the metadataUpdateSpec.
> > > > >
> > > > > "metadataUpdateSpec": {
> > > > >       "connectURI": "jdbc:mysql...",
> > > > >       "password": "XXXXXXX",
> > > > >       "segmentTable": "customprefix_segments",
> > > > >       "type": "mysql",
> > > > >       "user": "XXXXXXXX"
> > > > > },
> > > > >
> > > > > Looking at the code, I see that the segmentTable specified in the
> > spec
> > > is
> > > > > actually passed in as pending_segments table (3rd param is for
> > > > > pending_segments and 4th param is for segments table)
> > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/apache/incubator-druid/blob/master/indexing-hadoop/src/main/java/org/apache/druid/indexer/updater/MetadataStorageUpdaterJobSpec.java#L92
> > > > >
> > > > > As a result, the re-ingestion job tries to read from the default
> > > segments
> > > > > table named DRUID_SEGMENTS which isn't present.
> > > > >
> > > > > Is this intentional or a bug?
> > > > >
> > > > > Is there a way to configure the segments table name for this kind
> of
> > > > > re-ingestion job?
> > > > >
> > > >
> > >
> >
>

Re: HadoopIndexer job with input as the Druid datasource and configured segments table

Reply via email to