npawar commented on issue #3513: DATETIMECONVERT udf does not work for customized timezone and bucket size > 1 day URL: https://github.com/apache/incubator-pinot/issues/3513#issuecomment-440098316 I was able to reproduce this issue. The root cause for this, is this switch case for DAYS in [BaseDateTimeTransformer](https://github.com/apache/incubator-pinot/blob/master/pinot-core/src/main/java/com/linkedin/pinot/core/operator/transform/transformer/datetime/BaseDateTimeTransformer.java). This is where the _dateTimeTruncate is created, which is needed for converting millis to sdf. This only happens in cases where final output is expected in SDF ``` case DAYS: _dateTimeTruncate = (dateTime) -> _outputDateTimeFormatter.print( dateTime.withDayOfMonth( (dateTime.getDayOfMonth()/sz)*sz ).dayOfMonth().roundFloorCopy() ); break; ``` The problem is `dateTime.getDayOfMonth()` will return values in [1-31]. `sz` is the output granularity. If the granularity happens to be greater than the dayOfMonth, division will give us `0`. For example, in your case, outputGranularity=7 DAYS, and hence sz=7. So for days 1-6 this will fail. Which is also why it works when you try `1:DAYS` because there is no `0` generated in the division. The fix for this would depend on what the right output for this should be. The current logic will always create output based on the local boundary (for cases where output is in SDF) For example: Below is a list of UTC millis, starting 2018-09-19T00:00:00 to 2018-10-09T00:00:00. The second column is the output of applying `dateTimeConvert(millis, '1:MILLISECONDS:EPOCH', '1:DAYS:EPOCH', '7:DAYS')` The third column is the output of applying `dateTimeConvert(millis, '1:MILLISECONDS:EPOCH', '1:DAYS:SIMPLE_DATE_FORMAT:yyyyMMdd', '7:DAYS')` **They are not equivalent.** ``` 1537315200000L, 20180914, 17787 1537401600000L, 20180914, 17794 1537488000000L, 20180921, 17794 1537574400000L, 20180921, 17794 1537660800000L, 20180921, 17794 1537747200000L, 20180921, 17794 1537833600000L, 20180921, 17794 1537920000000L, 20180921, 17794 1538006400000L, 20180921, 17801 1538092800000L, 20180928, 17801 1538179200000L, 20180928, 17801 1538265600000L, 20180928, 17801 ``` In the first case, bucketing is done starting from **epoch start**. In the second case, bucketing is done such that the **start of the month resets the start of the bucketing** (this is because of the use of dayOfMonth to set the day) . Not stating that this approach is right or wrong, just explaining how it is working at the moment. This is how it would work for other granularities as well (hours bucketing will reset at the day boundary, etc) The fixes for this could be: 1) Preserve this behavior. In that case for days 1-7 of the month (2018-09-01 to 2018-09-07), we would get 1 bucket (2018-09-01), for days 8-14 we would get bucket 2 (2018-09-08). This would be the easiest fix, we would only need to make some changes to the division in that code block pasted above. 2) Make the bucketing for SDF also work starting epoch. This would need some refactoring of how this bucketing is done for SDF cases. Not sure if anyone is using this UDF and depending on the particular behavior. (@subprotocol ) 3) Leave this as is for now. For your specific usecase, you can use `(millis, '1:MILLISECONDS:EPOCH', '1:DAYS:EPOCH', '7:DAYS')` and convert the millis later to what you need. @kishoreg
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
