reallocf commented on issue #5268:
URL: 
https://github.com/apache/incubator-pinot/issues/5268#issuecomment-616272779


   Hey @npawar - I'm interested in helping out on this one. First initiative, 
so bare with me if my understanding is wrong about everything 😄. A question on 
the requirements:
   
   The idea is that we should be able to filter out data during ingestion, so 
if we have a `pinotSchema.json` like
   ```
   {
     "schemaName": "events",
     "dimensionFieldSpecs": [
       {
         "name": "userId",
         "dataType": "LONG",
         “transformFunction”: “Groovy({userID}, userID)”
       },
       {
         "name": "fullName",
         "dataType": "STRING",
         “transformFunction”: “Groovy({firstName+' '+lastName}, firstName, 
lastName)”
       },
      {
         "name": "bids",
         "dataType": "INT",
         "singleValueField": false
       },
       {
         "name": "maxBid",
         "dataType": "INT",
         "transformFunction": "Groovy({bids.max{ it.toBigDecimal() }}, bids)"
       }
     ],
     "metricFieldSpecs": [
       {
         "name": "impressions",
         "dataType": "LONG",
         “transformFunction”: “Groovy({eventType == 'IMPRESSION' ? 1: 0}, 
eventType)”
       },
       {
         "name": "clicks",
         "dataType": "LONG",
         “transformFunction”: “Groovy({eventType == CLICK ? 1: 0}, eventType)”
       },
       {
         "name": "cost",
         "dataType": "double"
       },
       {
         "name": "daysSinceEpoch",
         "dataType": "INT",
         “transformFunction”: “Groovy({timestamp/(1000*60*60*24)}, timestamp)”
       }
     ],
      "timeFieldSpec": {
       "incomingGranularitySpec": {
         "name": "hoursSinceEpoch",
         "dataType": "LONG",
         "timeFormat" : "EPOCH",
         "timeType": "HOURS",
         “transformFunction”: “Groovy({timestamp/(1000*60*60)}, timestamp)”
       }
     }
   }
   ```
   we would want to add a new top-level element in that json with a 
transformFunction  like
   ```
   {
     "schemaName": "events",
     "filter": "Groovy({cost > 42}, cost)",
     "dimensionFieldSpecs": [
       {
         "name": "userId",
         "dataType": "LONG",
         “transformFunction”: “Groovy({userID}, userID)”
       },
       ...
     ],
     ...
   }
   ```
   Then we would apply that transformFunction either
   a) On source columns - applying the rest of the transformations AFTER 
filtering
   or
   b) On transformed columns - applying the filtering after the rest of the 
transformations.
   
   Is that all right? Am I on the right track?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to