Hi Erick,

So all logs from node3 are entirely missing from the driver view?
That's sounds something wrong to me.

In Twill, it'll always try to flush the last bit of logs when a
container shutdown, however if that flush failed, there would be no
retry, since we don't want to hang the stopping of application.

However in your case, it seems like you get no log from a particular
node. Have you try to add couple seconds sleep in your Runnable.run()
before return and see if it is due to shutdown issue or something
else?

Terence

On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar
<[email protected]> wrote:
> Good evening,
>
> I'm not quite sure if this is a bug in my code or Twill. I've been working
> away on TWILL-78, but I'm running into some basic issues with me not
> seeming to receive all the log messages from my containers. You can find a
> copy of my toy application here:
> https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing
> twill with 3 containers. Occasionally I get everything I expect, but
> sometimes I only seem to get a subset of the log message on the worker
> nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on
> the nodes have:
>
> node1:
> ...
> Launching main: public static void
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> throws java.lang.Exception []
> 2014-05-10 00:20:09,334 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> entering barrier
> 2014-05-10 00:20:09,352 - WARN
> [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
> no ConnectionStateListeners registered.
> 2014-05-10 00:20:11,187 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> barrier
> 2014-05-10 00:20:11,188 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
> 2014-05-10 00:20:11,830 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
> barrier
> 2014-05-10 00:20:11,831 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> Main class completed.
> Launcher completed
> Cleanup directory tmp/twill.launcher-1399681206035-0
> ----
>
> node2:
> ...
> Launching main: public static void
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> throws java.lang.Exception []
> 2014-05-10 00:20:00,133 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> entering barrier
> 2014-05-10 00:20:00,158 - WARN
> [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
> no ConnectionStateListeners registered.
> 2014-05-10 00:20:02,161 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> barrier
> 2014-05-10 00:20:02,161 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
> 2014-05-10 00:20:02,979 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
> barrier
> 2014-05-10 00:20:02,979 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> Main class completed.
> Launcher completed
> Cleanup directory tmp/twill.launcher-1399681196232-0
> ----
>
> node3:
> ...
> Launching main: public static void
> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
> throws java.lang.Exception []
> 2014-05-10 00:20:01,191 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
> entering barrier
> 2014-05-10 00:20:01,197 - WARN
> [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There are
> no ConnectionStateListeners registered.
> 2014-05-10 00:20:02,768 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
> barrier
> 2014-05-10 00:20:02,769 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] - woken up
> 2014-05-10 00:20:03,587 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out of
> barrier
> 2014-05-10 00:20:03,588 - ERROR
> [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
> Main class completed.
> Launcher completed
> Cleanup directory tmp/twill.launcher-1399681197698-0
> ----
>
> But the driver script in this case only shows the output from 2 nodes:
>
> ----
> 2014-05-09 17:19:15,137 - WARN  [main:o.a.h.u.NativeCodeLoader@62] - Unable
> to load native-hadoop library for your platform... using builtin-java
> classes where applicable
> 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] -
> before getting completion
> 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> entering barrier
> 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
> entering barrier
> 2014-05-10T00:20:09,352Z WARN  o.a.c.f.s.ConnectionStateManager [node2]
> [ConnectionStateManager-0]
> ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> There are no ConnectionStateListeners registered.
> 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in
> barrier
> 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> woken up
> 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out
> of barrier
> 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done
> 2014-05-10T00:20:00,158Z WARN  o.a.c.f.s.ConnectionStateManager [node1]
> [ConnectionStateManager-0]
> ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
> There are no ConnectionStateListeners registered.
> 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) - in
> barrier
> 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
> woken up
> 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) - out
> of barrier
> 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
> [ServiceDelegate]
> GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) - done
> 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket
> error: Connection reset by peer
> 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] -
> after shutting down
> 2014-05-09 17:20:13,312 - ERROR [Thread-3:o.l.g.t.GraphlabApplication$1@130]
> - shutting down
> ---
>
> So is this something I'm doing wrong? Or is Twill or Kafka somehow shutting
> down before all the messages have been sent?
>
> Thanks for any help,
> -Erick

Reply via email to