CPU load: Tested on my 4-core machine the CPU load spikes up at the beginning of the job and stays relatively high during the whole job when run with version 0.5, then finishes gracefully. On version 0.6 it works seemingly well until the hangup. Interestingly enough even when no more log messages appear my CPU utilization stays 10-15% higher per core then without running the job.
logs: For both the implementation it starts like this: 09/04/2014 17:05:51: Job execution switched to status SCHEDULED 09/04/2014 17:05:51: DataSource (CSV Input (|) /home/mbalassi/git/als-comparison/data/sampledb2b.csv.txt) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: Reduce(Create q as a random matrix) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: PartialSolution (BulkIteration (Bulk Iteration)) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: Join(Sends the columns of q with multiple keys) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: CoGroup (For fixed q calculates optimal p) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: Join(Sends the rows of p with multiple keys)) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: CoGroup (For fixed p calculates optimal q) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: Fake Tail (1/1) switched to SCHEDULED 09/04/2014 17:05:51: Join(Sends the columns of q with multiple keys) (1/1) switched to SCHEDULED 09/04/2014 17:05:51: CoGroup (For fixed q calculates optimal p) (1/1) switched to SCHEDULED [Omitted quite some healthy messages...] 09/04/2014 17:05:53: Join(Sends the rows of p with multiple keys)) (1/1) switched to READY 09/04/2014 17:05:53: Join(Sends the rows of p with multiple keys)) (1/1) switched to STARTING 09/04/2014 17:05:53: Join(Sends the rows of p with multiple keys)) (1/1) switched to RUNNING 09/04/2014 17:05:53: CoGroup (For fixed p calculates optimal q) (1/1) switched to READY 09/04/2014 17:05:53: Fake Tail (1/1) switched to READY 09/04/2014 17:05:53: CoGroup (For fixed p calculates optimal q) (1/1) switched to STARTING 09/04/2014 17:05:53: Fake Tail (1/1) switched to STARTING 09/04/2014 17:05:54: CoGroup (For fixed p calculates optimal q) (1/1) switched to RUNNING 09/04/2014 17:05:54: Fake Tail (1/1) switched to RUNNING 09/04/2014 17:05:54: Join(Sends the columns of q with multiple keys) (1/1) switched to READY 09/04/2014 17:05:54: Join(Sends the columns of q with multiple keys) (1/1) switched to STARTING 09/04/2014 17:05:54: Join(Sends the columns of q with multiple keys) (1/1) switched to RUNNING 09/04/2014 17:05:54: CoGroup (For fixed q calculates optimal p) (1/1) switched to READY 09/04/2014 17:05:54: CoGroup (For fixed q calculates optimal p) (1/1) switched to STARTING 09/04/2014 17:05:55: CoGroup (For fixed q calculates optimal p) (1/1) switched to RUNNING Flink stops here, Strato continues: 09/04/2014 17:09:01: DataSource(CSV Input (|)) (1/1) switched to FINISHING 09/04/2014 17:09:02: PartialSolution (BulkIteration (Bulk Iteration)) (1/1) switched to READY 09/04/2014 17:09:02: PartialSolution (BulkIteration (Bulk Iteration)) (1/1) switched to STARTING 09/04/2014 17:09:02: PartialSolution (BulkIteration (Bulk Iteration)) (1/1) switched to RUNNING 09/04/2014 17:09:03: Reduce(Create q as a random matrix) (1/1) switched to FINISHING 09/04/2014 17:09:05: Sync(BulkIteration (Bulk Iteration)) (1/1) switched to READY 09/04/2014 17:09:05: Sync(BulkIteration (Bulk Iteration)) (1/1) switched to STARTING 09/04/2014 17:09:05: Sync(BulkIteration (Bulk Iteration)) (1/1) switched to RUNNING 09/04/2014 17:09:09: Sync(BulkIteration (Bulk Iteration)) (1/1) switched to FINISHING 09/04/2014 17:09:09: DataSink(hu.sztaki.ilab.cumulonimbus.als_comparison.strato.ColumnOutputFormatStrato@7ea742a1) (1/1) switched to READY 09/04/2014 17:09:09: DataSink(hu.sztaki.ilab.cumulonimbus.als_comparison.strato.ColumnOutputFormatStrato@7ea742a1) (1/1) switched to STARTING [Omitted quite some healthy messages...] 09/04/2014 17:09:10: PartialSolution (BulkIteration (Bulk Iteration)) (1/1) switched to FINISHED 09/04/2014 17:09:10: CoGroup(For fixed p calculates optimal q) (1/1) switched to FINISHED 09/04/2014 17:09:10: DataSink(hu.sztaki.ilab.cumulonimbus.als_comparison.strato.ColumnOutputFormatStrato@5dcde3f3) (1/1) switched to RUNNING 09/04/2014 17:09:10: CoGroup(For fixed q calculates optimal p) (1/1) switched to FINISHING 09/04/2014 17:09:10: DataSink(hu.sztaki.ilab.cumulonimbus.als_comparison.strato.ColumnOutputFormatStrato@5dcde3f3) (1/1) switched to FINISHING 09/04/2014 17:09:11: Join(Sends the columns of q with multiple keys) (1/1) switched to FINISHED 09/04/2014 17:09:11: CoGroup(For fixed q calculates optimal p) (1/1) switched to FINISHED 09/04/2014 17:09:11: DataSink(hu.sztaki.ilab.cumulonimbus.als_comparison.strato.ColumnOutputFormatStrato@5dcde3f3) (1/1) switched to FINISHED 09/04/2014 17:09:11: Job execution switched to status FINISHED On Thu, Sep 4, 2014 at 3:33 PM, Ufuk Celebi <[email protected]> wrote: > Hey Marton, > > thanks for reporting the issue and the link to the repo to reproduce the > problem. I will look into it later today. > > If you like, you could provide some more information in the meantime: > > - How the CPU load? > - What are TM logs saying? > - Can you give a stack trace? Where is it hanging? > > > > On Thu, Sep 4, 2014 at 3:14 PM, Márton Balassi <[email protected]> > wrote: > > > Hey, > > > > We managed to produce a code, for which the legacy Stratophere 0.5 > release > > implementation works nicely, however the updated Flink 0.6 release > > implementation hangs up for slightly larger inputs. > > > > > > Please check out the issue here: > > https://github.com/mbalassi/als-comparison > > > > Any suggestions are welcome. > > > > Cheers, > > > > Marton > > >
