Hi Erick,

That sounds like bug to me that the Twill log appender is not doing
proper flush when the Runnable terminates. I've created a ticket for
the investigation and fix.

https://issues.apache.org/jira/browse/TWILL-80

Terence

On Thu, May 15, 2014 at 3:58 PM, Erick Tryzelaar
<[email protected]> wrote:
> Hello Terence,
>
> Thanks for getting back to me. So it does seem to be some timing issue.
> I've updated my gist (https://gist.github.com/erickt/7b16d695b64384015b41)
> to add some extra controls and a driver script. If I do a sleep of `0`, it
> will sometimes work, and sometimes exit before all the `"done"` messages
> have been sent. However, If I do a sleep of `10`, it seems like I get all
> the messages.
>
>
> On Mon, May 12, 2014 at 10:39 AM, Terence Yim <[email protected]> wrote:
>
>> Hi Erick,
>>
>> So all logs from node3 are entirely missing from the driver view?
>> That's sounds something wrong to me.
>>
>> In Twill, it'll always try to flush the last bit of logs when a
>> container shutdown, however if that flush failed, there would be no
>> retry, since we don't want to hang the stopping of application.
>>
>> However in your case, it seems like you get no log from a particular
>> node. Have you try to add couple seconds sleep in your Runnable.run()
>> before return and see if it is due to shutdown issue or something
>> else?
>>
>> Terence
>>
>> On Fri, May 9, 2014 at 5:44 PM, Erick Tryzelaar
>> <[email protected]> wrote:
>> > Good evening,
>> >
>> > I'm not quite sure if this is a bug in my code or Twill. I've been
>> working
>> > away on TWILL-78, but I'm running into some basic issues with me not
>> > seeming to receive all the log messages from my containers. You can find
>> a
>> > copy of my toy application here:
>> > https://gist.github.com/erickt/7b16d695b64384015b41. I've been testing
>> > twill with 3 containers. Occasionally I get everything I expect, but
>> > sometimes I only seem to get a subset of the log message on the worker
>> > nodes. Here's an example. While in /var/log/hadoop-yarn/container/... on
>> > the nodes have:
>> >
>> > node1:
>> > ...
>> > Launching main: public static void
>> >
>> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
>> > throws java.lang.Exception []
>> > 2014-05-10 00:20:09,334 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
>> > entering barrier
>> > 2014-05-10 00:20:09,352 - WARN
>> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
>> are
>> > no ConnectionStateListeners registered.
>> > 2014-05-10 00:20:11,187 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
>> > barrier
>> > 2014-05-10 00:20:11,188 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
>> woken up
>> > 2014-05-10 00:20:11,830 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
>> of
>> > barrier
>> > 2014-05-10 00:20:11,831 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
>> > Main class completed.
>> > Launcher completed
>> > Cleanup directory tmp/twill.launcher-1399681206035-0
>> > ----
>> >
>> > node2:
>> > ...
>> > Launching main: public static void
>> >
>> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
>> > throws java.lang.Exception []
>> > 2014-05-10 00:20:00,133 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
>> > entering barrier
>> > 2014-05-10 00:20:00,158 - WARN
>> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
>> are
>> > no ConnectionStateListeners registered.
>> > 2014-05-10 00:20:02,161 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
>> > barrier
>> > 2014-05-10 00:20:02,161 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
>> woken up
>> > 2014-05-10 00:20:02,979 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
>> of
>> > barrier
>> > 2014-05-10 00:20:02,979 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
>> > Main class completed.
>> > Launcher completed
>> > Cleanup directory tmp/twill.launcher-1399681196232-0
>> > ----
>> >
>> > node3:
>> > ...
>> > Launching main: public static void
>> >
>> org.apache.twill.internal.container.TwillContainerMain.main(java.lang.String[])
>> > throws java.lang.Exception []
>> > 2014-05-10 00:20:01,191 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@57] -
>> > entering barrier
>> > 2014-05-10 00:20:01,197 - WARN
>> > [ConnectionStateManager-0:o.a.c.f.s.ConnectionStateManager@212] - There
>> are
>> > no ConnectionStateListeners registered.
>> > 2014-05-10 00:20:02,768 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@66] - in
>> > barrier
>> > 2014-05-10 00:20:02,769 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@76] -
>> woken up
>> > 2014-05-10 00:20:03,587 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@83] - out
>> of
>> > barrier
>> > 2014-05-10 00:20:03,588 - ERROR
>> > [ServiceDelegate:o.l.g.t.GraphlabApplication$GraphlabRunnable@93] - done
>> > Main class completed.
>> > Launcher completed
>> > Cleanup directory tmp/twill.launcher-1399681197698-0
>> > ----
>> >
>> > But the driver script in this case only shows the output from 2 nodes:
>> >
>> > ----
>> > 2014-05-09 17:19:15,137 - WARN  [main:o.a.h.u.NativeCodeLoader@62] -
>> Unable
>> > to load native-hadoop library for your platform... using builtin-java
>> > classes where applicable
>> > 2014-05-09 17:19:16,297 - ERROR [main:o.l.g.t.GraphlabApplication@136] -
>> > before getting completion
>> > 2014-05-10T00:20:00,133Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
>> > entering barrier
>> > 2014-05-10T00:20:09,334Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:57) -
>> > entering barrier
>> > 2014-05-10T00:20:09,352Z WARN  o.a.c.f.s.ConnectionStateManager [node2]
>> > [ConnectionStateManager-0]
>> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
>> > There are no ConnectionStateListeners registered.
>> > 2014-05-10T00:20:11,187Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
>> in
>> > barrier
>> > 2014-05-10T00:20:11,188Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
>> > woken up
>> > 2014-05-10T00:20:11,830Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
>> out
>> > of barrier
>> > 2014-05-10T00:20:11,831Z ERROR o.l.g.t.GraphlabApplication [node2]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
>> done
>> > 2014-05-10T00:20:00,158Z WARN  o.a.c.f.s.ConnectionStateManager [node1]
>> > [ConnectionStateManager-0]
>> > ConnectionStateManager:processEvents(ConnectionStateManager.java:212) -
>> > There are no ConnectionStateListeners registered.
>> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:66) -
>> in
>> > barrier
>> > 2014-05-10T00:20:02,161Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:76) -
>> > woken up
>> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:83) -
>> out
>> > of barrier
>> > 2014-05-10T00:20:02,979Z ERROR o.l.g.t.GraphlabApplication [node1]
>> > [ServiceDelegate]
>> > GraphlabApplication$GraphlabRunnable:run(GraphlabApplication.java:93) -
>> done
>> > 14/05/09 17:20:13 INFO consumer.SimpleConsumer: Reconnect due to socket
>> > error: Connection reset by peer
>> > 2014-05-09 17:20:13,310 - ERROR [main:o.l.g.t.GraphlabApplication@144] -
>> > after shutting down
>> > 2014-05-09 17:20:13,312 - ERROR
>> [Thread-3:o.l.g.t.GraphlabApplication$1@130]
>> > - shutting down
>> > ---
>> >
>> > So is this something I'm doing wrong? Or is Twill or Kafka somehow
>> shutting
>> > down before all the messages have been sent?
>> >
>> > Thanks for any help,
>> > -Erick
>>

Reply via email to