On 2019/06/24 18:07:42, Jihoon Son <jihoon...@apache.org> wrote:
> Hi,
>
> IngestSegmentFirehose is a firehose to read data from existing Druid
> dataSource. Since its format is always fixed (Druid's internal segment
> format), it doesn't need a parser, but needs only transformSpec to
> transform data properly during reingestion. Would you tell me why you want
> to specify the parser with IngestSegmentFirehose?
>
> Jihoon
>
> On Mon, Jun 24, 2019 at 10:36 AM Косов Максим Владимирович <mko...@cbgr.ru>
> wrote:
>
> > Hello,
> > I've got interested in Apache Druid and decided to study it. I've decided
> > to complete an example of sending data from one data source to another. For
> > this goal, I'm using ingestSegment Firehose. In the parser description, I'm
> > adding flattenSpec (the parser type is string, the format is json).
> > Here goes the configuration:
> > {
> > "type" : "index",
> > "spec" : {
> > "dataSchema" : {
> > "dataSource" : "cp37-data8",
> > "parser" : {
> > "type" : "string",
> > "parseSpec" : {
> > "format" : "json",
> > "timestampSpec" : {
> > "column" : "__time",
> > "format" : "auto"
> > },
> > "flattenSpec": {
> > "useFieldDiscovery": true,
> > "fields": [
> > {
> > "type": "jq",
> > "name": "resourceItemStatusDetails_updateDateTime",
> > "expr": ".fullDocument_data |
> > fromjson.resourceItemStatusDetails.updateDateTime.\"$date\""
> > }
> > ]
> > },
> > "dimensionsSpec" : {
> > "dimensions": [
> > "operationType",
> > "databaseName",
> > "collectionName",
> > "fullDocument_id",
> > "fullDocument_docId",
> > "resourceItemStatusDetails_updateDateTime",
> > {
> > "type": "long",
> > "name": "clusterTime"
> > }
> > ],
> > "dimensionExclusions" : [
> >
> > ],
> > "spatialDimensions" : []
> > }
> > }
> > },
> > "metricsSpec" : [
> > {
> > "type" : "count",
> > "name" : "count"
> > }
> > ],
> > "granularitySpec" : {
> > "type" : "uniform",
> > "segmentGranularity" : "DAY",
> > "queryGranularity" : "NONE"
> > }
> > },
> > "ioConfig" : {
> > "type" : "index",
> > "firehose" : {
> > "type" : "ingestSegment",
> > "dataSource" : "cp-all-buffer",
> > "interval" : "2018-01-01/2020-01-03"
> > },
> > "appendToExisting" : false
> > },
> > "tuningConfig" : {
> > "type" : "index",
> > "maxRowsPerSegment" : 100000,
> > "maxRowsInMemory" : 1000
> > }
> > }
> > }
> >
> > The task itself is executed successfully, but settings which I set up in
> > the parser are being ignored during the execution. I've taken a look at the
> > source code for Druid, and it seems that I have found a bug.
> > If you'll take a look at the IngestSegmentFirehoseFactory class, you'll
> > see that we pass only TransformSpec (which we got from the parser) to the
> > IngestSegmentFirehose constructor, but not the parser itself.
> > final TransformSpec transformSpec =
> > TransformSpec.fromInputRowParser(inputRowParser);
> > return new IngestSegmentFirehose(adapters, transformSpec, dims,
> > metricsList, dimFilter);
> >
> > Next, in IngestSegmentFirehose we're creating a transformer and perform a
> > transformation.
> >
> > final InputRow inputRow = rowYielder.get();
> >
> >
> > rowYielder = rowYielder.next(null);
> >
> >
> > return transformer.transform(inputRow);
> >
> > During this stage, we have already lost call of the method parse on the
> > parser, which explains the fact that in my example parser settings were
> > ignored.
> > It raises the question, why don't we just pass the parser itself to the
> > IngestSegmentFirehose constructor? If we'll take a look at the
> > TransformSpec.fromInputRowParser method implementation, we'll see that
> > there's always either a decorator with a transformer or error, so in the
> > implementations of such parsers in methods parse transformer always being
> > called additionally.
> >
> >
> > parser.parseBatch(row).stream().map(transformer::transform).collect(Collectors.toList());
> >
> > Could please anyone clarify if this is intentional behaviour or a bug?
> > Thanks!
> >
> >
> Hello,
Let me describe my case. I have a common data source “cp-all-buffer”, which
collects data that comes in different formats (in JSON format). Those different
data has common system fields. So, the main content in "cp-all-buffer" is being
stored in the column fullDocument as a JSON string.
The sample data looks like:
{
"clusterTime": 6699664333355352000,
"operationType": "insert",
"documentKey": {
"_id": {
"$oid": "5cf9fd342946bf071ac1fcae"
}
},
"databaseName": "service-prop-37",
"fullDocument": { // different model }
}
Next, during the data transfer into the data source “cp37-data8”, I need to
split that field fullDocument into a set columns. parseSpec allows us to do
advanced operations with JSON, including the usage of “jq” which is important,
while unfortunately at the same time transformSpec provides much less
flexibility.
In my opinion, even if the Druid's internal segment format is always fixed, we
could still use the parser since that allows us for more flexibility. We
already have parser’s decorators used with transformer in the code base, for
example TransformingInputRowParser class , so that shouldn’t be that much of a
problem.
I wanted to write my own Firehouse at first, but later I thought that perhaps
we could modify IngestSegmentFirehose so everyone could benefit from these
additions.
If you need any additional information, please let me know.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@druid.apache.org
For additional commands, e-mail: dev-h...@druid.apache.org