Re: Flink Runtime Exception

2015-06-19 Thread Asterios Katsifodimos
Hi Andra,

I would try increasing the memory per task manager, i.e. on a machine with
8 CPUs and 16GBs of memory, instead of spawning 8 TMs with 2GB each, I
would try to spawn 2 TMs of 8GBs each.
This might help with the spilling problem (in case that the CPU is not your
bottleneck, this might even speed up the computations by avoiding spilling)
and get you unstuck.

Cheers,
Asterios


On Fri, Jun 19, 2015 at 4:16 PM, Ufuk Celebi u...@apache.org wrote:

 On 19 Jun 2015, at 14:53, Andra Lungu lungu.an...@gmail.com wrote:

  Another problem that I encountered during the same set of
 experiments(sorry
  if I am asking too many questions, I am eager to get things fixed):
  - for the same configuration, a piece of code runs perfectly on 10GB of
  input, then for 38GB it runs forever (no deadlock).
 
  I believe that may occur because Flink spills information to disk every
  time it runs out of memory... Is this fixable by increasing the number of
  buffers?

 If you are referring to the number of network buffers configuration key,
 that should be unrelated. If this really is the issue, you can increase the
 heap size for the task managers.

 Have you confirmed your suspicion as Till suggested via iotop? :)

 – Ufuk


Re: Rework of the window-join semantics

2015-04-16 Thread Asterios Katsifodimos
As far as I see in [1], Peter's/Gyula's suggestion is what Infosphere
Streams does: symmetric hash join.

From [1]:
When a tuple is received on an input port, it is inserted into the window
corresponding to the input port, which causes the window to trigger. As
part of the trigger processing, the tuple is compared against all tuples
inside the window of the opposing input port. If the tuples match, then an
output tuple will be produced for each match. If at least one output was
generated, a window punctuation will be generated after all the outputs.

Cheers,
Asterios

[1]
http://www-01.ibm.com/support/knowledgecenter/#!/SSCRJU_3.2.1/com.ibm.swg.im.infosphere.streams.spl-standard-toolkit-reference.doc/doc/join.html



On Thu, Apr 9, 2015 at 1:30 PM, Matthias J. Sax 
mj...@informatik.hu-berlin.de wrote:

 Hi Paris,

 thanks for the pointer to the Naiad paper. That is quite interesting.

 The paper I mentioned [1], does not describe the semantics in detail; it
 is more about the implementation for the stream-joins. However, it uses
 the same semantics (from my understanding) as proposed by Gyula.

 -Matthias

 [1] Kang, Naughton, Viglas. Evaluationg Window Joins over Unbounded
 Streams. VLDB 2002.



 On 04/07/2015 12:38 PM, Paris Carbone wrote:
  Hello Matthias,
 
  Sure, ordering guarantees are indeed a tricky thing, I recall having
 that discussion back in TU Berlin. Bear in mind thought that DataStream,
 our abstract data type, represents a *partitioned* unbounded sequence of
 events. There are no *global* ordering guarantees made whatsoever in that
 model across partitions. If you see it more generally there are many “race
 conditions” in a distributed execution graph of vertices that process
 multiple inputs asynchronously, especially when you add joins and
 iterations into the mix (how do you deal with reprocessing “old” tuples
 that iterate in the graph). Btw have you checked the Naiad paper [1]?
 Stephan cited a while ago and it is quite relevant to that discussion.
 
  Also, can you cite the paper with the joining semantics you are
 referring to? That would be of good help I think.
 
  Paris
 
  [1] https://users.soe.ucsc.edu/~abadi/Papers/naiad_final.pdf
 
  https://users.soe.ucsc.edu/~abadi/Papers/naiad_final.pdf
 
  https://users.soe.ucsc.edu/~abadi/Papers/naiad_final.pdf
  On 07 Apr 2015, at 11:50, Matthias J. Sax mj...@informatik.hu-berlin.de
 mailto:mj...@informatik.hu-berlin.de wrote:
 
  Hi @all,
 
  please keep me in the loop for this work. I am highly interested and I
  want to help on it.
 
  My initial thoughts are as follows:
 
  1) Currently, system timestamps are used and the suggested approach can
  be seen as state-of-the-art (there is actually a research paper using
  the exact same join semantic). Of course, the current approach is
  inherently non-deterministic. The advantage is, that there is no
  overhead in keeping track of the order of records and the latency should
  be very low. (Additionally, state-recovery is simplified. Because, the
  processing in inherently non-deterministic, recovery can be done with
  relaxed guarantees).
 
   2) The user should be able to switch on deterministic processing,
  ie, records are timestamped (either externally when generated, or
  timestamped at the sources). Because deterministic processing adds some
  overhead, the user should decide for it actively.
  In this case, the order must be preserved in each re-distribution step
  (merging is sufficient, if order is preserved within each incoming
  channel). Furthermore, deterministic processing can be achieved by sound
  window semantics (and there is a bunch of them). Even for
  single-stream-windows it's a tricky problem; for join-windows it's even
  harder. From my point of view, it is less important which semantics are
  chosen; however, the user must be aware how it works. The most tricky
  part for deterministic processing, is to deal with duplicate timestamps
  (which cannot be avoided). The timestamping for (intermediate) result
  tuples, is also an important question to be answered.
 
 
  -Matthias
 
 
  On 04/07/2015 11:37 AM, Gyula Fóra wrote:
  Hey,
 
  I agree with Kostas, if we define the exact semantics how this works,
 this
  is not more ad-hoc than any other stateful operator with multiple inputs.
  (And I don't think any other system support something similar)
 
  We need to make some design choices that are similar to the issues we had
  for windowing. We need to chose how we want to evaluate the windowing
  policies (global or local) because that affects what kind of policies can
  be parallel, but I can work on these things.
 
  I think this is an amazing feature, so I wouldn't necessarily rush the
  implementation for 0.9 though.
 
  And thanks for helping writing these down.
 
  Gyula
 
  On Tue, Apr 7, 2015 at 11:11 AM, Kostas Tzoumas ktzou...@apache.org
 mailto:ktzou...@apache.org wrote:
 
  Yes, we should write these semantics down. I volunteer to help.
 
  I don't