[GitHub] [pinot] lakshmanan-v commented on issue #7422: Support aggregate functions to better handle timeseries data

GitBox Thu, 13 Jan 2022 10:38:14 -0800


lakshmanan-v commented on issue #7422:
URL: https://github.com/apache/pinot/issues/7422#issuecomment-1012405273



   > Here is the document for Pre-Aggregation Gapfilling function design: 
https://docs.google.com/document/d/1FQbsTVywWLlmFaeUcYqDkb-XRWCpVhZe10BZyD2VzcU/edit#heading=h.brbishq9xbz1
   > 
   > @siddharthteotia @jackjlli @amrishlal @lakshmanan-v and I had couple of 
rounds of design meetings. Thanks a lot for the effort! Here is the meeting 
minutes:
   > 
   > 1. We went through the design document and discussed the query design and 
implementation design. We reached the consensus that the implementation looks 
good.
   > 2. We debated about the query design. We came out two feasible options:
   >    flat query option:
   >    SELECT
   >    PREAGGREGATEGAPFILL(DATETIMECONVERT(event_time, '1:MILLISECONDS:EPOCH', 
'1:MILLISECONDS:EPOCH', '15:MINUTES'), 
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS', '2021-10-01 
09:00:00+00', '2021-10-01 12:00:00+00', '15:MINUTES', TIME_SERIES_ON(eventTime, 
lotId), FILTERING(is_occupied = 1), FILL(is_occupied,FILL_TYPE_PREVIOUS)) AS 
time_col,
   >    SUM(is_occupied) AS occupied_slots_count
   >    FROM parking_data
   >    WHERE event_time >= '10/01/2021 09:00:00+00' AND  event_time <= 
'10/01/2021 12:00:00+00'
   >    GROUP BY time_col
   >    ORDER BY time_col
   > 
   > and subquery option: SELECT time_col, SUM(is_occupied) AS 
occupied_slots_count FROM ( SELECT 
PREAGGREGATEGAPFILL(DATETIMECONVERT(event_time, '1:MILLISECONDS:EPOCH', 
'1:MILLISECONDS:EPOCH', '15:MINUTES'), 
'1:MILLISECONDS:SIMPLE_DATE_FORMAT:yyyy-MM-dd HH:mm:ss.SSS', '2021-10-01 
09:00:00+00', '2021-10-01 12:00:00+00', '15:MINUTES', TIME_SERIES_ON(eventTime, 
lotId), FILL(is_occupied,FILL_TYPE_PREVIOUS)) AS time_col, is_occupied FROM 
parking_data WHERE event_time >= '10/01/2021 09:00:00+00' AND event_time <= 
'10/01/2021 12:00:00+00') WHERE isOccupied = 1 GROUP BY time_col ORDER BY 
time_col
   > 
   > As for the first option (flat query option), the PreAggregateGapFill 
function contains the hint about the pre-aggregate gapfilling. Everything about 
pre-aggregation is contained inside this function. But its problem is that the 
first selection expression will generate more rows. The aggregation selection 
expression will reduce the number of rows. This is counterintuitive. The 
execution order is not explicit even though the document can provide more 
details. Also for first option, we need define FILTERING inside the 
PreAggregateGapFill function in order to filter out the gapfilled result. The 
second option
   > 
   > As for the second option (the subquery option), it separates the 
gapfilling step and the aggregation step. The inner query logically defines the 
gapfilling step. The aggregation step is defined by the outer query. Its 
advantage is that the execution step is more explicit than the flat query 
option. Its problem is that the subquery feature is not in place. It might 
conflict with the future implementation of subquery. The solution is to make it 
specific in order to prevent the conflict. When subquery is in place, 
pre-aggregation gapfilling can migrated to leverage the subquery feature or be 
compatible with subquery feature.
   
   cc @Jackie-Jiang @mayankshriv @kishoreg @elonazoulay 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [pinot] lakshmanan-v commented on issue #7422: Support aggregate functions to better handle timeseries data

Reply via email to