Hi Eli,

Congratulations on transferring to new job! I've known your change of
occupation from LinkedIn.

Anyway, I've expected that there are many issues that intrigue you.
Especially, there are some interesting problems about App master. App
master should handle a DAG of subqueries. Some subqueries which are
independent to each other can be executed at the same time if the cluster
has available resources. In addition, we will have a scheduler for tasks of
running and multiple subqueries. As you are very experienced in Yarn, you
may be interested in this problem. Tomorrow, I'll attend some conference
for a week. After the conference, I would like to discuss this problem with
you and other members.

Best regards,
Hyunsik




On Fri, Apr 5, 2013 at 2:55 PM, Eli Reisman <[email protected]>wrote:

> Hey guys fascinating discussion and thanks for the explanation Hyunsik. I
> am training on a new job right now and swamped, but I'm excited to get
> breather and take a look at this interesting problem.
>
> As for the code: its really clean and great, nice work everyone! I'm really
> impressed at the clarity, its easy to read! There are little things to fix
> in any codebase, we'll get it done!
>
> So this is interesting to me espeically since the App master is launching
> different task runners for the subqueries and managing their lifecycles.
> That definitely adds a layer of interesting complexity to the app master's
> job. I will take a deeper look and see if I notice anything relating to
> TAJO-26 that might be helpful, when I get the chance.
>
>
>
>
> On Wed, Apr 3, 2013 at 9:54 PM, Tanujit Ghosh (JIRA) <[email protected]
> >wrote:
>
> >
> >     [
> >
> https://issues.apache.org/jira/browse/TAJO-15?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621648#comment-13621648
> ]
> >
> > Tanujit Ghosh commented on TAJO-15:
> > -----------------------------------
> >
> > Hi,
> >
> > I have raised TAJO-26 issue, the environment i'm on is fedora 17 (linux
> > kernel 3.8.4), sun java 1.6.0_41.
> >
> > Yes i'm running mvn verify from the shell.
> >
> > From what i see in the log, there is an error with the data file not
> being
> > found, maybe i have missed some setting which needs to be done.
> >
> >
> >
> >
> >
> >
> > --
> > Regards,
> > Tanujit
> >
> >
> > > The Integration test is getting hanged on Mac OS X.
> > > ---------------------------------------------------
> > >
> > >                 Key: TAJO-15
> > >                 URL: https://issues.apache.org/jira/browse/TAJO-15
> > >             Project: Tajo
> > >          Issue Type: Bug
> > >         Environment: OS: Mac 10.8.3
> > > Both JVMs:
> > > {noformat}
> > > java version "1.6.0_43"
> > > Java(TM) SE Runtime Environment (build 1.6.0_43-b01-447-11M4203)
> > > Java HotSpot(TM) 64-Bit Server VM (build 20.14-b01-447, mixed mode)
> > > {noformat}
> > > {noformat}
> > > java version "1.7.0_10"
> > > Java(TM) SE Runtime Environment (build 1.7.0_10-b18)
> > > Java HotSpot(TM) 64-Bit Server VM (build 23.6-b04, mixed mode)
> > > {noformat}
> > >            Reporter: Hyunsik Choi
> > >            Assignee: Hyunsik Choi
> > >             Fix For: 0.2-incubating
> > >
> > >         Attachments: TAJO-15.patch
> > >
> > >
> > > The Integration test is getting hanged on Mac OS X. The below is the
> > unit test logs reported by Ashish.
> > > http://markmail.org/message/lknrqecc27v4thbb
> > > {noformat}
> > > 2013-03-28 16:42:39,039 INFO  capacity.CapacityScheduler
> > > (CapacityScheduler.java:completedContainer(776)) - Application
> > > appattempt_1364469093530_0002_000001 released container
> > > container_1364469093530_0002_01_000007 on node: host: a.b.c.d:60941
> > > #containers=0 available=4096 used=0 with event: FINISHED
> > > 2013-03-28 16:42:39,235 INFO  rmcontainer.RMContainerImpl
> > > (RMContainerImpl.java:handle(220)) -
> > container_1364469093530_0002_01_000008
> > > Container Transitioned from ALLOCATED to ACQUIRED
> > > 2013-03-28 16:42:39,236 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 1
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(175)) -
> > > ================================================================
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(177)) - > Container Id:
> > > container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(178)) - > Node Id:
> > > a.b.c.d:60945
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(179)) - > Resource (Mem):
> > 3072
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(180)) - > State : NEW
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(181)) - > Priority: 92
> > > 2013-03-28 16:42:39,237 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(183)) -
> > > ================================================================
> > > 2013-03-28 16:42:39,238 INFO  master.SubQuery
> > > (SubQuery.java:transition(713)) - SubQuery
> > > (sq_1364469093530_0002_000001_27) has 1 containers!
> > > 2013-03-28 16:42:39,238 INFO  master.TaskRunnerLauncherImpl
> > > (TaskRunnerLauncherImpl.java:launch(393)) - Launching Container with
> Id:
> > > container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:39,239 INFO  master.TaskRunnerLauncherImpl
> > > (TaskRunnerLauncherImpl.java:createContainerLaunchContext(301)) -
> > Completed
> > > setting up taskrunner command ${JAVA_HOME}/bin/java -Xmx2000m
> > > tajo.worker.TaskRunner a.b.c.d 58243 sq_1364469093530_0002_000001_27
> > > a.b.c.d:60945 container_1364469093530_0002_01_000008 1><LOG_DIR>/stdout
> > > 2><LOG_DIR>/stderr
> > > 2013-03-28 16:42:39,244 INFO  containermanager.ContainerManagerImpl
> > > (ContainerManagerImpl.java:startContainer(402)) - Start request for
> > > container_1364469093530_0002_01_000008 by user xxxxxxx
> > > 2013-03-28 16:42:39,245 INFO  nodemanager.NMAuditLogger
> > > (NMAuditLogger.java:logSuccess(89)) - USER=xxxxxxx IP=a.b.c.d
> > OPERATION=Start
> > > Container Request TARGET=ContainerManageImpl RESULT=SUCCESS
> > > APPID=application_1364469093530_0002
> > > CONTAINERID=container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:39,245 INFO  application.Application
> > > (ApplicationImpl.java:transition(255)) - Adding
> > > container_1364469093530_0002_01_000008 to application
> > > application_1364469093530_0002
> > > 2013-03-28 16:42:39,246 INFO  container.Container
> > > (ContainerImpl.java:handle(835)) - Container
> > > container_1364469093530_0002_01_000008 transitioned from NEW to
> > LOCALIZING
> > > 2013-03-28 16:42:39,246 INFO  master.TaskRunnerLauncherImpl
> > > (TaskRunnerLauncherImpl.java:launch(424)) - PullServer port returned by
> > > ContainerManager for container_1364469093530_0002_01_000008 : 60947
> > > 2013-03-28 16:42:39,246 INFO  containermanager.AuxServices
> > > (AuxServices.java:handle(160)) - Got event APPLICATION_INIT for appId
> > > application_1364469093530_0002
> > > 2013-03-28 16:42:39,246 INFO  containermanager.AuxServices
> > > (AuxServices.java:handle(164)) - Got APPLICATION_INIT for service
> > > tajo.pullserver
> > > 2013-03-28 16:42:39,246 INFO  master.Query (Query.java:handle(514)) -
> > > Processing q_1364469093530_0002_000001 of type INIT_COMPLETED
> > > 2013-03-28 16:42:39,246 INFO  container.Container
> > > (ContainerImpl.java:handle(835)) - Container
> > > container_1364469093530_0002_01_000008 transitioned from LOCALIZING to
> > > LOCALIZED
> > > 2013-03-28 16:42:39,247 INFO  util.RackResolver
> > > (RackResolver.java:coreResolve(100)) - Resolved L-IDC77TDV7M-M.local to
> > > /default-rack
> > > 2013-03-28 16:42:39,339 INFO  container.Container
> > > (ContainerImpl.java:handle(835)) - Container
> > > container_1364469093530_0002_01_000008 transitioned from LOCALIZED to
> > > RUNNING
> > > 2013-03-28 16:42:39,340 INFO  monitor.ContainersMonitorImpl
> > > (ContainersMonitorImpl.java:isEnabled(168)) - ResourceCalculatorPlugin
> is
> > > unavailable on this system.
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> > > is disabled.
> > > 2013-03-28 16:42:39,535 INFO  nodemanager.DefaultContainerExecutor
> > > (DefaultContainerExecutor.java:launchContainer(175)) - launchContainer:
> > > [bash,
> > >
> >
> /Users/xxxxxxx/opensource/tajo/incubator-tajo/tajo-core/tajo-core-backend/target/tajo.TajoTestingCluster/tajo.TajoTestingCluster-localDir-nm-1_0/usercache/xxxxxxx/appcache/application_1364469093530_0002/container_1364469093530_0002_01_000008/default_container_executor.sh]
> > > 2013-03-28 16:42:39,903 INFO  nodemanager.NodeStatusUpdaterImpl
> > > (NodeStatusUpdaterImpl.java:getNodeStatus(265)) - Sending out status
> for
> > > container: container_id {, app_attempt_id {, application_id {, id: 2,
> > > cluster_timestamp: 1364469093530, }, attemptId: 1, }, id: 8, }, state:
> > > C_RUNNING, diagnostics: "", exit_status: -1000,
> > > 2013-03-28 16:42:39,904 INFO  rmcontainer.RMContainerImpl
> > > (RMContainerImpl.java:handle(220)) -
> > container_1364469093530_0002_01_000008
> > > Container Transitioned from ACQUIRED to RUNNING
> > > 2013-03-28 16:42:40,020 WARN  nodemanager.DefaultContainerExecutor
> > > (DefaultContainerExecutor.java:launchContainer(193)) - Exit code from
> > task
> > > is : 1
> > > 2013-03-28 16:42:40,021 INFO  nodemanager.ContainerExecutor
> > > (ContainerExecutor.java:logOutput(167)) -
> > > 2013-03-28 16:42:40,021 WARN  launcher.ContainerLaunch
> > > (ContainerLaunch.java:call(274)) - Container exited with a non-zero
> exit
> > > code 1
> > > 2013-03-28 16:42:40,021 INFO  container.Container
> > > (ContainerImpl.java:handle(835)) - Container
> > > container_1364469093530_0002_01_000008 transitioned from RUNNING to
> > > EXITED_WITH_FAILURE
> > > 2013-03-28 16:42:40,021 INFO  launcher.ContainerLaunch
> > > (ContainerLaunch.java:cleanupContainer(300)) - Cleaning up container
> > > container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:40,040 INFO  nodemanager.DefaultContainerExecutor
> > > (DefaultContainerExecutor.java:deleteAsUser(273)) - Deleting absolute
> > path
> > > :
> > >
> >
> /Users/xxxxxxx/opensource/tajo/incubator-tajo/tajo-core/tajo-core-backend/target/tajo.TajoTestingCluster/tajo.TajoTestingCluster-localDir-nm-1_0/usercache/xxxxxxx/appcache/application_1364469093530_0002/container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:40,040 WARN  nodemanager.NMAuditLogger
> > > (NMAuditLogger.java:logFailure(150)) - USER=xxxxxxx OPERATION=Container
> > > Finished - Failed TARGET=ContainerImpl RESULT=FAILURE
> > DESCRIPTION=Container
> > > failed with state: EXITED_WITH_FAILURE
> > APPID=application_1364469093530_0002
> > > CONTAINERID=container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:40,041 INFO  container.Container
> > > (ContainerImpl.java:handle(835)) - Container
> > > container_1364469093530_0002_01_000008 transitioned from
> > > EXITED_WITH_FAILURE to DONE
> > > 2013-03-28 16:42:40,041 INFO  application.Application
> > > (ApplicationImpl.java:transition(298)) - Removing
> > > container_1364469093530_0002_01_000008 from application
> > > application_1364469093530_0002
> > > 2013-03-28 16:42:40,041 INFO  monitor.ContainersMonitorImpl
> > > (ContainersMonitorImpl.java:isEnabled(168)) - ResourceCalculatorPlugin
> is
> > > unavailable on this system.
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl
> > > is disabled.
> > > 2013-03-28 16:42:40,241 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:40,241 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:40,905 INFO  nodemanager.NodeStatusUpdaterImpl
> > > (NodeStatusUpdaterImpl.java:getNodeStatus(265)) - Sending out status
> for
> > > container: container_id {, app_attempt_id {, application_id {, id: 2,
> > > cluster_timestamp: 1364469093530, }, attemptId: 1, }, id: 8, }, state:
> > > C_COMPLETE, diagnostics: "\n", exit_status: 1,
> > > 2013-03-28 16:42:40,905 INFO  nodemanager.NodeStatusUpdaterImpl
> > > (NodeStatusUpdaterImpl.java:getNodeStatus(271)) - Removed completed
> > > container container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:40,906 INFO  rmcontainer.RMContainerImpl
> > > (RMContainerImpl.java:handle(220)) -
> > container_1364469093530_0002_01_000008
> > > Container Transitioned from RUNNING to COMPLETED
> > > 2013-03-28 16:42:40,906 INFO  fica.FiCaSchedulerApp
> > > (FiCaSchedulerApp.java:containerCompleted(219)) - Completed container:
> > > container_1364469093530_0002_01_000008 in state: COMPLETED
> event:FINISHED
> > > 2013-03-28 16:42:40,906 INFO  resourcemanager.RMAuditLogger
> > > (RMAuditLogger.java:logSuccess(98)) - USER=xxxxxxx OPERATION=AM
> Released
> > > Container TARGET=SchedulerApp RESULT=SUCCESS
> > > APPID=application_1364469093530_0002
> > > CONTAINERID=container_1364469093530_0002_01_000008
> > > 2013-03-28 16:42:40,906 INFO  fica.FiCaSchedulerNode
> > > (FiCaSchedulerNode.java:releaseContainer(150)) - Released container
> > > container_1364469093530_0002_01_000008 of capacity <memory:3072,
> > vCores:1>
> > > on host a.b.c.d:60945, which currently has 0 containers, <memory:0,
> > > vCores:0> used and <memory:4096, vCores:16> available, release
> > > resources=true
> > > 2013-03-28 16:42:40,906 INFO  capacity.LeafQueue
> > > (LeafQueue.java:releaseResource(1441)) - default used=<memory:0,
> > vCores:0>
> > > numContainers=0 user=xxxxxxx user-resources=<memory:0, vCores:0>
> > > 2013-03-28 16:42:40,907 INFO  capacity.LeafQueue
> > > (LeafQueue.java:completedContainer(1385)) - completedContainer
> > > container=Container: [ContainerId:
> > container_1364469093530_0002_01_000008,
> > > NodeId: a.b.c.d:60945, NodeHttpAddress: a.b.c.d:60948, Resource:
> > > <memory:3072, vCores:1>, Priority: 92, State: NEW, Token: null, Status:
> > > container_id {, app_attempt_id {, application_id {, id: 2,
> > > cluster_timestamp: 1364469093530, }, attemptId: 1, }, id: 8, }, state:
> > > C_COMPLETE, diagnostics: "\n", exit_status: 1, ] resource=<memory:3072,
> > > vCores:1> queue=default: capacity=1.0, absoluteCapacity=1.0,
> > > usedResources=<memory:0, vCores:0>usedCapacity=0.0,
> > > absoluteUsedCapacity=0.0, numApps=1, numContainers=0 usedCapacity=0.0
> > > absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
> cluster=<memory:12288,
> > > vCores:48>
> > > 2013-03-28 16:42:40,907 INFO  capacity.ParentQueue
> > > (ParentQueue.java:completedContainer(696)) - completedContainer
> > queue=root
> > > usedCapacity=0.0 absoluteUsedCapacity=0.0 used=<memory:0, vCores:0>
> > > cluster=<memory:12288, vCores:48>
> > > 2013-03-28 16:42:40,907 INFO  capacity.CapacityScheduler
> > > (CapacityScheduler.java:completedContainer(776)) - Application
> > > appattempt_1364469093530_0002_000001 released container
> > > container_1364469093530_0002_01_000008 on node: host: a.b.c.d:60945
> > > #containers=0 available=4096 used=0 with event: FINISHED
> > > 2013-03-28 16:42:41,242 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:41,242 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:42,245 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:42,246 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:43,248 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:43,249 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:44,251 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:44,252 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:45,255 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:45,256 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:46,259 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:46,260 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:47,263 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > 2013-03-28 16:42:47,264 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(173)) - Num of Allocated
> > > Containers: 0
> > > 2013-03-28 16:42:48,267 INFO  rm.RMContainerAllocator
> > > (RMContainerAllocator.java:makeRemoteRequest(172)) - Available
> Resource:
> > > <memory:6144, vCores:-1>
> > > {noformat}
> >
> > --
> > This message is automatically generated by JIRA.
> > If you think it was sent incorrectly, please contact your JIRA
> > administrators
> > For more information on JIRA, see:
> http://www.atlassian.com/software/jira
> >
>

Reply via email to