Hi Maciej,

I believe it would be useful to either fix the documentation or fix the
implementation. I'll leave it to the community to comment on. The code
right now disallows intervals provided in months and years, because they
are not a "consistently" fixed amount of time. A month can be 28, 29, 30,
or 31 days. A year is 12 months for sure, but is it 360 days (sometimes
used in finance), 365 days or 366 days?

Therefore we could either:
  1) Allow windowing when intervals are given in days and less, even though
it could be 365 days, and fix the documentation.
  2) Explicitly disallow it as there may be a lot of data for a given
window, but partial aggregations should help with that.

My thoughts are to go with 1. What do you think?

Best,
Burak

On Wed, Jan 18, 2017 at 10:18 AM, Maciej Szymkiewicz <mszymkiew...@gmail.com
> wrote:

> Hi,
>
> Can I ask for some clarifications regarding intended behavior of window /
> TimeWindow?
>
> PySpark documentation states that "Windows in the order of months are not
> supported". This is further confirmed by the checks in 
> TimeWindow.getIntervalInMicroseconds
> (https://git.io/vMP5l).
>
> Surprisingly enough we can pass interval much larger than a month by
> expressing interval in days or another unit of a higher precision. So this
> fails:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "1 month"))
>
> while following is accepted:
>
> Seq("2017-01-01").toDF("date").groupBy(window($"date", "999 days"))
>
> with results which look sensible at first glance.
>
> Is it a matter of a faulty validation logic (months will be assigned only
> if there is a match against years or months https://git.io/vMPdi) or
> expected behavior and I simply misunderstood the intentions?
>
> --
> Best,
> Maciej
>
>

Reply via email to