[GitHub] [incubator-pinot] reallocf commented on issue #5268: Filter during ingestion

GitBox Sun, 19 Apr 2020 19:19:13 -0700


reallocf commented on issue #5268:
URL: 
https://github.com/apache/incubator-pinot/issues/5268#issuecomment-616272779



   Hey @npawar - I'm interested in helping out on this one. First initiative, 
so bare with me if my understanding is wrong about everything 😄. A question on 
the requirements:
   
   The idea is that we should be able to filter out data during ingestion, so 
if we have a `pinotSchema.json` like
   ```
   {
     "schemaName": "events",
     "dimensionFieldSpecs": [
       {
         "name": "userId",
         "dataType": "LONG",
         “transformFunction”: “Groovy({userID}, userID)”
       },
       {
         "name": "fullName",
         "dataType": "STRING",
         “transformFunction”: “Groovy({firstName+' '+lastName}, firstName, 
lastName)”
       },
      {
         "name": "bids",
         "dataType": "INT",
         "singleValueField": false
       },
       {
         "name": "maxBid",
         "dataType": "INT",
         "transformFunction": "Groovy({bids.max{ it.toBigDecimal() }}, bids)"
       }
     ],
     "metricFieldSpecs": [
       {
         "name": "impressions",
         "dataType": "LONG",
         “transformFunction”: “Groovy({eventType == 'IMPRESSION' ? 1: 0}, 
eventType)”
       },
       {
         "name": "clicks",
         "dataType": "LONG",
         “transformFunction”: “Groovy({eventType == CLICK ? 1: 0}, eventType)”
       },
       {
         "name": "cost",
         "dataType": "double"
       },
       {
         "name": "daysSinceEpoch",
         "dataType": "INT",
         “transformFunction”: “Groovy({timestamp/(1000*60*60*24)}, timestamp)”
       }
     ],
      "timeFieldSpec": {
       "incomingGranularitySpec": {
         "name": "hoursSinceEpoch",
         "dataType": "LONG",
         "timeFormat" : "EPOCH",
         "timeType": "HOURS",
         “transformFunction”: “Groovy({timestamp/(1000*60*60)}, timestamp)”
       }
     }
   }
   ```
   we would want to add a new top-level element in that json with a 
transformFunction  like
   ```
   {
     "schemaName": "events",
     "filter": "Groovy({cost > 42}, cost)",
     "dimensionFieldSpecs": [
       {
         "name": "userId",
         "dataType": "LONG",
         “transformFunction”: “Groovy({userID}, userID)”
       },
       ...
     ],
     ...
   }
   ```
   Then we would apply that transformFunction either
   a) On source columns - applying the rest of the transformations AFTER 
filtering
   or
   b) On transformed columns - applying the filtering after the rest of the 
transformations.
   
   Is that all right? Am I on the right track?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-pinot] reallocf commented on issue #5268: Filter during ingestion

Reply via email to