cbgr opened a new issue #7976: ParseSpec setting are being ignored with ingestSegment Firehose URL: https://github.com/apache/incubator-druid/issues/7976 Hello, I've got interested in Apache Druid and decided to study it. I’ve decided to complete an example of sending data from one data source to another. For this goal, I’m using ingestSegment Firehose. In the parser description, I’m adding flattenSpec (the parser type is string, the format is json). Here goes the configuration: _ > { > "type" : "index", > "spec" : { > "dataSchema" : { > "dataSource" : "cp37-data8", > "parser" : { > "type" : "string", > "parseSpec" : { > "format" : "json", > "timestampSpec" : { > "column" : "__time", > "format" : "auto" > }, > "flattenSpec": { > "useFieldDiscovery": true, > "fields": [ > { > "type": "jq", > "name": "resourceItemStatusDetails_updateDateTime", > "expr": ".fullDocument_data | fromjson.resourceItemStatusDetails.updateDateTime.\"$date\"" > } > ] > }, > "dimensionsSpec" : { > "dimensions": [ > "operationType", > "databaseName", > "collectionName", > "fullDocument_id", > "fullDocument_docId", > "resourceItemStatusDetails_updateDateTime", > { > "type": "long", > "name": "clusterTime" > } > ], > "dimensionExclusions" : [ > > ], > "spatialDimensions" : [] > } > } > }, > "metricsSpec" : [ > { > "type" : "count", > "name" : "count" > } > ], > "granularitySpec" : { > "type" : "uniform", > "segmentGranularity" : "DAY", > "queryGranularity" : "NONE" > } > }, > "ioConfig" : { > "type" : "index", > "firehose" : { > "type" : "ingestSegment", > "dataSource" : "cp-all-buffer", > "interval" : "2018-01-01/2020-01-03" > }, > "appendToExisting" : false > }, > "tuningConfig" : { > "type" : "index", > "maxRowsPerSegment" : 100000, > "maxRowsInMemory" : 1000 > } > } > } _ The task itself is executed successfully, but settings which I set up in the parser are being ignored during the execution. I’ve taken a look at the source code for Druid, and it seems that I have found a bug. If you’ll take a look at the IngestSegmentFirehoseFactory class, you’ll see that we pass only TransformSpec (which we got from the parser) to the IngestSegmentFirehose constructor, but not the parser itself. _> final TransformSpec transformSpec = TransformSpec.fromInputRowParser(inputRowParser); > return new IngestSegmentFirehose(adapters, transformSpec, dims, metricsList, dimFilter);_ Next, in IngestSegmentFirehose we’re creating a transformer and perform a transformation. _> final InputRow inputRow = rowYielder.get(); > rowYielder = rowYielder.next(null); > return transformer.transform(inputRow);_ During this stage, we have already lost call of the method parse on the parser, which explains the fact that in my example parser settings were ignored. It raises the question, why don’t we just pass the parser itself to the IngestSegmentFirehose constructor? If we’ll take a look at the _TransformSpec.fromInputRowParser_ method implementation, we’ll see that there’s always either a decorator with a transformer or error, so in the implementations of such parsers in methods _parse_ transformer always being called additionally. _parser.parseBatch(row).stream().map(transformer::transform).collect(Collectors.toList());_ Could please anyone clarify if this is intentional behaviour or a bug? Thanks!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
