Sounds great! And thanks again, I'm excited about the job although I miss
working with Jakob at LinkedIn and Arun at Hortonworks very much. Yes, even
if we have YARN problems that I can't figure out after some looking, I know
who to ask among the YARN folks now so we will track down all the problems
for sure!

The other issue is that the main Hadoop stack guy here at Etsy is quitting
to work at Apple next week, so instead of a month to learn about the
system, I have a week to take over his job! So I will be super busy for the
next few weeks just getting myself up to speed in case their workflows
start crashing or other non-Hadoop plumbing starts to break.

Good news: they are very curious about Giraph, Tajo, Tez, and Stinger. They
have a dependency issue that is temporarily preventing them from updating
to good versions of YARN like hadoop-2.0.3-alpha, but that will be one of
the first things I try to fix as we go along. The dependency problem is
owned by another team but I am hoping I can come to an agreement with them
soon!

OK, have fun at the conference, and don't worry...as things calm down for
me I will get more and more involved code-wise. For now, I want to review
patches and provide advice on YARN or anythign else I can on the message
boards and mailing list.

Thanks again for including me, its been really really fun so far! One
useful thing I will try to do is look at the YARN code in detail coming up
here and just put up some JIRAs where I see problems so that everyone will
know about them.

Eli



On Fri, Apr 5, 2013 at 3:16 PM, Hyunsik Choi <[email protected]> wrote:

> Hi Eli,
>
> Congratulations on transferring to new job! I've known your change of
> occupation from LinkedIn.
>
> Anyway, I've expected that there are many issues that intrigue you.
> Especially, there are some interesting problems about App master. App
> master should handle a DAG of subqueries. Some subqueries which are
> independent to each other can be executed at the same time if the cluster
> has available resources. In addition, we will have a scheduler for tasks of
> running and multiple subqueries. As you are very experienced in Yarn, you
> may be interested in this problem. Tomorrow, I'll attend some conference
> for a week. After the conference, I would like to discuss this problem with
> you and other members.
>
> Best regards,
> Hyunsik
>
>
>
>
> On Fri, Apr 5, 2013 at 2:55 PM, Eli Reisman <[email protected]
> >wrote:
>
> > Hey guys fascinating discussion and thanks for the explanation Hyunsik. I
> > am training on a new job right now and swamped, but I'm excited to get
> > breather and take a look at this interesting problem.
> >
> > As for the code: its really clean and great, nice work everyone! I'm
> really
> > impressed at the clarity, its easy to read! There are little things to
> fix
> > in any codebase, we'll get it done!
> >
> > So this is interesting to me espeically since the App master is launching
> > different task runners for the subqueries and managing their lifecycles.
> > That definitely adds a layer of interesting complexity to the app
> master's
> > job. I will take a deeper look and see if I notice anything relating to
> > TAJO-26 that might be helpful, when I get the chance.
> >
> >
> >
> >
> > On Wed, Apr 3, 2013 at 9:54 PM, Tanujit Ghosh (JIRA) <[email protected]
> > >wrote:
> >
> > >
> > >     [
> > >
> >
> https://issues.apache.org/jira/browse/TAJO-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621648#comment-13621648
> > ]
> > >
> > > Tanujit Ghosh commented on TAJO-15:
> > > -----------------------------------
> > >
> > > Hi,
> > >
> > > I have raised TAJO-26 issue, the environment i'm on is fedora 17 (linux
> > > kernel 3.8.4), sun java 1.6.0_41.
> > >
> > > Yes i'm running mvn verify from the shell.
> > >
> > > From what i see in the log, there is an error with the data file not
> > being
> > > found, maybe i have missed some setting which needs to be done.
> > >
> > >
> > >
> > >
> > >
> > >
> > > --
> > > Regards,
> > > Tanujit
> > >
> > >
> > > > The Integration test is getting hanged on Mac OS X.
> > > > ---------------------------------------------------
> > > >
> > > >                 Key: TAJO-15
> > > >                 URL: https://issues.apache.org/jira/browse/TAJO-15
> > > >             Project: Tajo
> > > >          Issue Type: Bug
> > > >         Environment: OS: Mac 10.8.3
> > > > Both JVMs:
> > > > {noformat}
> > > > java version "1.6.0_43"
> > > > Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> > > > Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
> > > > {noformat}
> > > > {noformat}
> > > > java version "1.7.0_10"
> > > > Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
> > > > Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
> > > > {noformat}
> > > >            Reporter: Hyunsik Choi
> > > >            Assignee: Hyunsik Choi
> > > >             Fix For: 0.2-incubating
> > > >
> > > >         Attachments: TAJO-15.patch
> > > >
> > > >
> > > > The Integration test is getting hanged on Mac OS X. The below is the
> > > unit test logs reported by Ashish.
> > > > http://markmail.org/message/lknrqecc27v4thbb
> > > > {noformat}
> > > > 2013-03-28 16:42:39,039 INFO  capacity.CapacityScheduler
> > > > (CapacityScheduler.java:completedContainer(776)) - Application
> > > > appattempt_1364469093530_0002_000001 released container
> > > > container_1364469093530_0002_01_000007 on node: host: a.b.c.d:60941
> > > > #containers=0 available=4096 used=0 with event: FINISHED
> > > > 2013-03-28 16:42:39,235 INFO  rmcontainer.RMContainerImpl
> > > > (RMContainerImpl.java:handle(220)) -
> > > container_1364469093530_0002_01_000008
> > > > Container Transitioned from ALLOCATED to ACQUIRED
> > > > 2013-03-28 16:42:39,236 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 1
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(175)) -
> > > > ================================================================
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(177)) - > Container Id:
> > > > container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(178)) - > Node Id:
> > > > a.b.c.d:60945
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(179)) - > Resource
> (Mem):
> > > 3072
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(180)) - > State : NEW
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(181)) - > Priority: 92
> > > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(183)) -
> > > > ================================================================
> > > > 2013-03-28 16:42:39,238 INFO  master.SubQuery
> > > > (SubQuery.java:transition(713)) - SubQuery
> > > > (sq_1364469093530_0002_000001_27) has 1 containers!
> > > > 2013-03-28 16:42:39,238 INFO  master.TaskRunnerLauncherImpl
> > > > (TaskRunnerLauncherImpl.java:launch(393)) - Launching Container with
> > Id:
> > > > container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:39,239 INFO  master.TaskRunnerLauncherImpl
> > > > (TaskRunnerLauncherImpl.java:createContainerLaunchContext(301)) -
> > > Completed
> > > > setting up taskrunner command ${JAVA_HOME}/bin/java -Xmx2000m
> > > > tajo.worker.TaskRunner a.b.c.d 58243 sq_1364469093530_0002_000001_27
> > > > a.b.c.d:60945 container_1364469093530_0002_01_000008
> 1><LOG_DIR>/stdout
> > > > 2><LOG_DIR>/stderr
> > > > 2013-03-28 16:42:39,244 INFO  containermanager.ContainerManagerImpl
> > > > (ContainerManagerImpl.java:startContainer(402)) - Start request for
> > > > container_1364469093530_0002_01_000008 by user xxxxxxx
> > > > 2013-03-28 16:42:39,245 INFO  nodemanager.NMAuditLogger
> > > > (NMAuditLogger.java:logSuccess(89)) - USER=xxxxxxx IP=a.b.c.d
> > > OPERATION=Start
> > > > Container Request TARGET=ContainerManageImpl RESULT=SUCCESS
> > > > APPID=application_1364469093530_0002
> > > > CONTAINERID=container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:39,245 INFO  application.Application
> > > > (ApplicationImpl.java:transition(255)) - Adding
> > > > container_1364469093530_0002_01_000008 to application
> > > > application_1364469093530_0002
> > > > 2013-03-28 16:42:39,246 INFO  container.Container
> > > > (ContainerImpl.java:handle(835)) - Container
> > > > container_1364469093530_0002_01_000008 transitioned from NEW to
> > > LOCALIZING
> > > > 2013-03-28 16:42:39,246 INFO  master.TaskRunnerLauncherImpl
> > > > (TaskRunnerLauncherImpl.java:launch(424)) - PullServer port returned
> by
> > > > ContainerManager for container_1364469093530_0002_01_000008 : 60947
> > > > 2013-03-28 16:42:39,246 INFO  containermanager.AuxServices
> > > > (AuxServices.java:handle(160)) - Got event APPLICATION_INIT for appId
> > > > application_1364469093530_0002
> > > > 2013-03-28 16:42:39,246 INFO  containermanager.AuxServices
> > > > (AuxServices.java:handle(164)) - Got APPLICATION_INIT for service
> > > > tajo.pullserver
> > > > 2013-03-28 16:42:39,246 INFO  master.Query (Query.java:handle(514)) -
> > > > Processing q_1364469093530_0002_000001 of type INIT_COMPLETED
> > > > 2013-03-28 16:42:39,246 INFO  container.Container
> > > > (ContainerImpl.java:handle(835)) - Container
> > > > container_1364469093530_0002_01_000008 transitioned from LOCALIZING
> to
> > > > LOCALIZED
> > > > 2013-03-28 16:42:39,247 INFO  util.RackResolver
> > > > (RackResolver.java:coreResolve(100)) - Resolved L-IDC77TDV7M-M.local
> to
> > > > /default-rack
> > > > 2013-03-28 16:42:39,339 INFO  container.Container
> > > > (ContainerImpl.java:handle(835)) - Container
> > > > container_1364469093530_0002_01_000008 transitioned from LOCALIZED to
> > > > RUNNING
> > > > 2013-03-28 16:42:39,340 INFO  monitor.ContainersMonitorImpl
> > > > (ContainersMonitorImpl.java:isEnabled(168)) -
> ResourceCalculatorPlugin
> > is
> > > > unavailable on this system.
> > > >
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> > > > is disabled.
> > > > 2013-03-28 16:42:39,535 INFO  nodemanager.DefaultContainerExecutor
> > > > (DefaultContainerExecutor.java:launchContainer(175)) -
> launchContainer:
> > > > [bash,
> > > >
> > >
> >
> /Users/xxxxxxx/opensource/tajo/incubator-tajo/tajo-core/tajo-core-backend/target/tajo.TajoTestingCluster/tajo.TajoTestingCluster-localDir-nm-1_0/usercache/xxxxxxx/appcache/application_1364469093530_0002/container_1364469093530_0002_01_000008/default_container_executor.sh]
> > > > 2013-03-28 16:42:39,903 INFO  nodemanager.NodeStatusUpdaterImpl
> > > > (NodeStatusUpdaterImpl.java:getNodeStatus(265)) - Sending out status
> > for
> > > > container: container_id {, app_attempt_id {, application_id {, id: 2,
> > > > cluster_timestamp: 1364469093530, }, attemptId: 1, }, id: 8, },
> state:
> > > > C_RUNNING, diagnostics: "", exit_status: -1000,
> > > > 2013-03-28 16:42:39,904 INFO  rmcontainer.RMContainerImpl
> > > > (RMContainerImpl.java:handle(220)) -
> > > container_1364469093530_0002_01_000008
> > > > Container Transitioned from ACQUIRED to RUNNING
> > > > 2013-03-28 16:42:40,020 WARN  nodemanager.DefaultContainerExecutor
> > > > (DefaultContainerExecutor.java:launchContainer(193)) - Exit code from
> > > task
> > > > is : 1
> > > > 2013-03-28 16:42:40,021 INFO  nodemanager.ContainerExecutor
> > > > (ContainerExecutor.java:logOutput(167)) -
> > > > 2013-03-28 16:42:40,021 WARN  launcher.ContainerLaunch
> > > > (ContainerLaunch.java:call(274)) - Container exited with a non-zero
> > exit
> > > > code 1
> > > > 2013-03-28 16:42:40,021 INFO  container.Container
> > > > (ContainerImpl.java:handle(835)) - Container
> > > > container_1364469093530_0002_01_000008 transitioned from RUNNING to
> > > > EXITED_WITH_FAILURE
> > > > 2013-03-28 16:42:40,021 INFO  launcher.ContainerLaunch
> > > > (ContainerLaunch.java:cleanupContainer(300)) - Cleaning up container
> > > > container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:40,040 INFO  nodemanager.DefaultContainerExecutor
> > > > (DefaultContainerExecutor.java:deleteAsUser(273)) - Deleting absolute
> > > path
> > > > :
> > > >
> > >
> >
> /Users/xxxxxxx/opensource/tajo/incubator-tajo/tajo-core/tajo-core-backend/target/tajo.TajoTestingCluster/tajo.TajoTestingCluster-localDir-nm-1_0/usercache/xxxxxxx/appcache/application_1364469093530_0002/container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:40,040 WARN  nodemanager.NMAuditLogger
> > > > (NMAuditLogger.java:logFailure(150)) - USER=xxxxxxx
> OPERATION=Container
> > > > Finished - Failed TARGET=ContainerImpl RESULT=FAILURE
> > > DESCRIPTION=Container
> > > > failed with state: EXITED_WITH_FAILURE
> > > APPID=application_1364469093530_0002
> > > > CONTAINERID=container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:40,041 INFO  container.Container
> > > > (ContainerImpl.java:handle(835)) - Container
> > > > container_1364469093530_0002_01_000008 transitioned from
> > > > EXITED_WITH_FAILURE to DONE
> > > > 2013-03-28 16:42:40,041 INFO  application.Application
> > > > (ApplicationImpl.java:transition(298)) - Removing
> > > > container_1364469093530_0002_01_000008 from application
> > > > application_1364469093530_0002
> > > > 2013-03-28 16:42:40,041 INFO  monitor.ContainersMonitorImpl
> > > > (ContainersMonitorImpl.java:isEnabled(168)) -
> ResourceCalculatorPlugin
> > is
> > > > unavailable on this system.
> > > >
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> > > > is disabled.
> > > > 2013-03-28 16:42:40,241 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:40,241 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:40,905 INFO  nodemanager.NodeStatusUpdaterImpl
> > > > (NodeStatusUpdaterImpl.java:getNodeStatus(265)) - Sending out status
> > for
> > > > container: container_id {, app_attempt_id {, application_id {, id: 2,
> > > > cluster_timestamp: 1364469093530, }, attemptId: 1, }, id: 8, },
> state:
> > > > C_COMPLETE, diagnostics: "\n", exit_status: 1,
> > > > 2013-03-28 16:42:40,905 INFO  nodemanager.NodeStatusUpdaterImpl
> > > > (NodeStatusUpdaterImpl.java:getNodeStatus(271)) - Removed completed
> > > > container container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:40,906 INFO  rmcontainer.RMContainerImpl
> > > > (RMContainerImpl.java:handle(220)) -
> > > container_1364469093530_0002_01_000008
> > > > Container Transitioned from RUNNING to COMPLETED
> > > > 2013-03-28 16:42:40,906 INFO  fica.FiCaSchedulerApp
> > > > (FiCaSchedulerApp.java:containerCompleted(219)) - Completed
> container:
> > > > container_1364469093530_0002_01_000008 in state: COMPLETED
> > event:FINISHED
> > > > 2013-03-28 16:42:40,906 INFO  resourcemanager.RMAuditLogger
> > > > (RMAuditLogger.java:logSuccess(98)) - USER=xxxxxxx OPERATION=AM
> > Released
> > > > Container TARGET=SchedulerApp RESULT=SUCCESS
> > > > APPID=application_1364469093530_0002
> > > > CONTAINERID=container_1364469093530_0002_01_000008
> > > > 2013-03-28 16:42:40,906 INFO  fica.FiCaSchedulerNode
> > > > (FiCaSchedulerNode.java:releaseContainer(150)) - Released container
> > > > container_1364469093530_0002_01_000008 of capacity <memory:3072,
> > > vCores:1>
> > > > on host a.b.c.d:60945, which currently has 0 containers, <memory:0,
> > > > vCores:0> used and <memory:4096, vCores:16> available, release
> > > > resources=true
> > > > 2013-03-28 16:42:40,906 INFO  capacity.LeafQueue
> > > > (LeafQueue.java:releaseResource(1441)) - default used=<memory:0,
> > > vCores:0>
> > > > numContainers=0 user=xxxxxxx user-resources=<memory:0, vCores:0>
> > > > 2013-03-28 16:42:40,907 INFO  capacity.LeafQueue
> > > > (LeafQueue.java:completedContainer(1385)) - completedContainer
> > > > container=Container: [ContainerId:
> > > container_1364469093530_0002_01_000008,
> > > > NodeId: a.b.c.d:60945, NodeHttpAddress: a.b.c.d:60948, Resource:
> > > > <memory:3072, vCores:1>, Priority: 92, State: NEW, Token: null,
> Status:
> > > > container_id {, app_attempt_id {, application_id {, id: 2,
> > > > cluster_timestamp: 1364469093530, }, attemptId: 1, }, id: 8, },
> state:
> > > > C_COMPLETE, diagnostics: "\n", exit_status: 1, ]
> resource=<memory:3072,
> > > > vCores:1> queue=default: capacity=1.0, absoluteCapacity=1.0,
> > > > usedResources=<memory:0, vCores:0>usedCapacity=0.0,
> > > > absoluteUsedCapacity=0.0, numApps=1, numContainers=0 usedCapacity=0.0
> > > > absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
> > cluster=<memory:12288,
> > > > vCores:48>
> > > > 2013-03-28 16:42:40,907 INFO  capacity.ParentQueue
> > > > (ParentQueue.java:completedContainer(696)) - completedContainer
> > > queue=root
> > > > usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
> > > > cluster=<memory:12288, vCores:48>
> > > > 2013-03-28 16:42:40,907 INFO  capacity.CapacityScheduler
> > > > (CapacityScheduler.java:completedContainer(776)) - Application
> > > > appattempt_1364469093530_0002_000001 released container
> > > > container_1364469093530_0002_01_000008 on node: host: a.b.c.d:60945
> > > > #containers=0 available=4096 used=0 with event: FINISHED
> > > > 2013-03-28 16:42:41,242 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:41,242 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:42,245 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:42,246 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:43,248 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:43,249 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:44,251 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:44,252 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:45,255 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:45,256 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:46,259 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:46,260 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:47,263 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > 2013-03-28 16:42:47,264 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > > Containers: 0
> > > > 2013-03-28 16:42:48,267 INFO  rm.RMContainerAllocator
> > > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> > Resource:
> > > > <memory:6144, vCores:-1>
> > > > {noformat}
> > >
> > > --
> > > This message is automatically generated by JIRA.
> > > If you think it was sent incorrectly, please contact your JIRA
> > > administrators
> > > For more information on JIRA, see:
> > http://www.atlassian.com/software/jira
> > >
> >
>

Reply via email to