Since sessions are built per key, they have groups of keys that are close
enough together in time. They will, however, treat the closeness
transitively...

On Tue, Nov 24, 2015 at 11:33 AM, Matthias J. Sax <mj...@apache.org> wrote:

> Stephan is right. A tumbling window does not help. The last tuple of
> window n and the first tuple of window n+1 are "close" to each other and
> should be joined for example.
>
> From a SQL-like point of view this is a very common case expressed as:
>
> SELECT * FROM s1,s2 WHERE s1.key = s2.key AND |s1.ts - s2.ts| < window-size
>
> I would not expect to get any duplicates here.
>
> Basically, the window should move by one tuple (for each stream) and
> join with all tuples from the other stream that are within the time
> range (window size) were the ts of this new tuple define the boundaries
> of the window (ie, there are no "fixed" window boundaries as defined by
> a time-slide).
>
> Not sure how a "session window" can help here... I guess using most
> generic window API allows to define slide by one tuple and window size X
> seconds. But I don't know how duplicates could be avoided...
>
> -Matthias
>
> On 11/24/2015 11:04 AM, Stephan Ewen wrote:
> > I understand Matthias' point. You want to join elements that occur
> within a
> > time range of each other.
> >
> > In a tumbling window, you have strict boundaries and a pair of elements
> > that arrives such that one element is before the boundary and one after,
> > they will not join. Hence the sliding windows.
> >
> > What may be a solution here is a "session window" join...
> >
> > On Tue, Nov 24, 2015 at 10:33 AM, Aljoscha Krettek <aljos...@apache.org>
> > wrote:
> >
> >> Hi,
> >> I’m not sure this is a problem. If a user specifies sliding windows then
> >> one element can (and will) end up in several windows. If these are
> joined
> >> then there will be multiple results. If the user does not want multiple
> >> windows then tumbling windows should be used.
> >>
> >> IMHO, this is quite straightforward. But let’s see what others have to
> say.
> >>
> >> Cheers,
> >> Aljoscha
> >>> On 23 Nov 2015, at 20:36, Matthias J. Sax <mj...@apache.org> wrote:
> >>>
> >>> Hi,
> >>>
> >>> it seems that a join on the data streams with an overlapping sliding
> >>> window produces duplicates in the output. The default implementation
> >>> internally just use two nested-loops over both windows to compute the
> >>> result.
> >>>
> >>> How can duplicates be avoided? Is there any way after all right now? If
> >>> not, should be add this?
> >>>
> >>>
> >>> -Matthias
> >>>
> >>
> >>
> >
>
>

Reply via email to