Hi Maciej, I believe it would be useful to either fix the documentation or fix the implementation. I'll leave it to the community to comment on. The code right now disallows intervals provided in months and years, because they are not a "consistently" fixed amount of time. A month can be 28, 29, 30, or 31 days. A year is 12 months for sure, but is it 360 days (sometimes used in finance), 365 days or 366 days?
Therefore we could either: 1) Allow windowing when intervals are given in days and less, even though it could be 365 days, and fix the documentation. 2) Explicitly disallow it as there may be a lot of data for a given window, but partial aggregations should help with that. My thoughts are to go with 1. What do you think? Best, Burak On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <mszymkiew...@gmail.com > wrote: > Hi, > > Can I ask for some clarifications regarding intended behavior of window / > TimeWindow? > > PySpark documentation states that "Windows in the order of months are not > supported". This is further confirmed by the checks in > TimeWindow.getIntervalInMicroseconds > (https://git.io/vMP5l). > > Surprisingly enough we can pass interval much larger than a month by > expressing interval in days or another unit of a higher precision. So this > fails: > > Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month")) > > while following is accepted: > > Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days")) > > with results which look sensible at first glance. > > Is it a matter of a faulty validation logic (months will be assigned only > if there is a match against years or months https://git.io/vMPdi) or > expected behavior and I simply misunderstood the intentions? > > -- > Best, > Maciej > >