+1, we should just fix the error to explain why months aren't allowed and suggest that you manually specify some number of days.
On Wed, Jan 18, 2017 at 9:52 AM, Maciej Szymkiewicz <mszymkiew...@gmail.com> wrote: > Thanks for the response Burak, > > As any sane person I try to steer away from the objects which have both > calendar and unsafe in their fully qualified names but if there is no > bigger picture I missed here I would go with 1 as well. And of course fix > the error message. I understand this has been introduced with structured > streaming in mind, but it is an useful feature in general, not only in high > precision scale. To be honest I would love to see some generalized version > which could be used (I mean without hacking) with arbitrary numeric > sequence. It could address at least some scenarios in which people try to > use window functions without PARTITION BY clause and fail miserably. > > Regarding ambiguity... Sticking with days doesn't really resolve the > problem, does it? If one were to nitpick it doesn't look like this > implementation even touches all the subtleties of DST or leap second. > > > > > On 01/18/2017 05:52 PM, Burak Yavuz wrote: > > Hi Maciej, > > I believe it would be useful to either fix the documentation or fix the > implementation. I'll leave it to the community to comment on. The code > right now disallows intervals provided in months and years, because they > are not a "consistently" fixed amount of time. A month can be 28, 29, 30, > or 31 days. A year is 12 months for sure, but is it 360 days (sometimes > used in finance), 365 days or 366 days? > > Therefore we could either: > 1) Allow windowing when intervals are given in days and less, even > though it could be 365 days, and fix the documentation. > 2) Explicitly disallow it as there may be a lot of data for a given > window, but partial aggregations should help with that. > > My thoughts are to go with 1. What do you think? > > Best, > Burak > > On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz < > mszymkiew...@gmail.com> wrote: > >> Hi, >> >> Can I ask for some clarifications regarding intended behavior of window / >> TimeWindow? >> >> PySpark documentation states that "Windows in the order of months are not >> supported". This is further confirmed by the checks in >> TimeWindow.getIntervalInMicroseconds (https://git.io/vMP5l). >> >> Surprisingly enough we can pass interval much larger than a month by >> expressing interval in days or another unit of a higher precision. So this >> fails: >> >> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month")) >> >> while following is accepted: >> >> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days")) >> >> with results which look sensible at first glance. >> >> Is it a matter of a faulty validation logic (months will be assigned only >> if there is a match against years or months https://git.io/vMPdi) or >> expected behavior and I simply misunderstood the intentions? >> >> -- >> Best, >> Maciej >> >> > >