jimj commented on issue #8916: Kinesis Indexing does not honor `transformSpec` 
of the `dataSchema`
URL: 
https://github.com/apache/incubator-druid/issues/8916#issuecomment-557192112
 
 
   I'm sorry to say this still does not appear to work for me.  I've pared my 
example down to the bare bones.  I will attempt to provide some context / 
proof.  I've substituted some string values.
   
   I have a kinesis stream with very wide (~150 dimensions) events on it.  One 
of these dimensions is "eventType".  My stream has 3 possible event types on it 
currently: "foo", "bar", and "baz".
   
   My full ingestion spec is as follows:
   ```
   {
     "type": "kinesis",
     "dataSchema": {
       "dataSource": "stream_filter_poc",
       "parser": {
         "type": "string",
         "parseSpec": {
           "format": "json",
           "timestampSpec": {
             "column": "eventRecordDate",
             "format": "yyyy-MM-dd HH:mm:ss.SSS"
           },
           "dimensionsSpec": {
             "dimensions": [
               "eventType"
             ]
           }
         },
         "metricSpec": [
           {
             "type": "count",
             "name": "count"
           }
         ],
         "transformSpec": {
           "filter": {
             "type": "or",
             "fields": [
               {
                 "type": "selector",
                 "dimension": "eventType",
                 "value": "foo"
               },
               {
                 "type": "selector",
                 "dimension": "eventType",
                 "value": "bar"
               }
             ]
           }
         }
       },
       "granularitySpec": {
         "type": "uniform",
         "segmentGranularity": "DAY",
         "queryGranularity": "HOUR"
       }
     },
     "tuningConfig": {
       "type": "kinesis",
       "maxRowsPerSegment": 5000000,
       "logParseExceptions": true
     },
     "ioConfig": {
       "stream": "stream-filter-poc",
       "endpoint": "kinesis.us-east-1.amazonaws.com",
       "taskCount": 1,
       "replicas": 1,
       "taskDuration": "PT5M",
       "recordsPerFetch": 2000,
       "fetchDelayMillis": 1000
     }
   }
   ```
   
   After submitting this supervisor spec and waiting a bit for some data, I 
issue the following query:
   ```
   {
           "dataSource": "stream_filter_poc",
           "dimension": "eventType",
           "metric": "count",
           "queryType": "topN",
           "threshold": 5,
           "intervals": ["2019-11-20T00:00:00.000/2019-11-21T23:59:59.999"],
           "aggregations": [{"type":"count","name":"count"}],
           "granularity": "DAY"
   }
   
   ```
   
   and the results I get back are
   ```
   [
     {
       "timestamp": "2019-11-21T00:00:00.000Z",
       "result": [
         {
           "count": 7,
           "eventType": "foo"
         },
         {
           "count": 5,
           "eventType": "bar"
         },
         {
           "count": 7,
           "eventType": "baz"
         }
       ]
     }
   ]
   
   ```
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to