Do your nodes have enough resources for all of the requested components to
start?

On Tue, Oct 28, 2014 at 11:40 AM, Rui Zhang <[email protected]> wrote:

> Made the fix but still cannot make it.
> Actually, the steps to reproduce in SLIDER-439 is different from mine.
> What I do is first use "freeze" command and then kill one node manager.
> Wait long enough for the node manager leave the Yarn cluster. And then use
> "thaw" command to restart.
> However, the instance that was running on that killed node is not able to
> restart.
>
> Here is part of the log.
>
> 14/10/28 18:25:42 INFO mortbay.log: Logging to org.slf4j.impl.
> Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
> 14/10/28 18:25:42 INFO zookeeper.ClientCnxn: Session establishment
> complete on server vertica1/172.17.42.1:16433, sessionid =
> 0x14957f07d6f011f, negotiated timeout = 40000
> 14/10/28 18:25:42 INFO state.ConnectionStateManager: State change:
> CONNECTED
> 14/10/28 18:25:42 INFO mortbay.log: jetty-6.1.26
> Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.PackagesResourceConfig
> init
> INFO: Scanning for root resource and provider classes in the packages:
>   org.apache.slider.server.appmaster.web.rest.agent
> Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig
> logClasses
> INFO: Root resource classes found:
>   class org.apache.slider.server.appmaster.web.rest.agent.AgentWebServices
> Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig
> init
> INFO: No provider classes found.
> Oct 28, 2014 6:25:42 PM 
> com.sun.jersey.server.impl.application.WebApplicationImpl
> _initiate
> INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17
> AM'
> 14/10/28 18:25:43 INFO mortbay.log: Started [email protected].
> 0.0:46561
> 14/10/28 18:25:43 INFO mortbay.log: Started [email protected].
> 0.0:36451
> 14/10/28 18:25:43 INFO http.HttpRequestLog: Http request log for
> http.requests.slideram is not defined
> 14/10/28 18:25:43 INFO http.HttpServer2: Added global filter 'safety'
> (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter)
> 14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER
> (class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to
> context slideram
> 14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER
> (class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to
> context static
> 14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /slideram/*
> 14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /ws/*
> 14/10/28 18:25:43 INFO http.HttpServer2: Jetty bound to port 47481
> 14/10/28 18:25:43 INFO mortbay.log: jetty-6.1.26
> 14/10/28 18:25:43 INFO mortbay.log: Extract jar:file:/home/rzhang/Slider_
> Vertica/Linux64/Test/verticadb1000/HDP2_1/hadoop/local/usercache/rzhang/
> appcache/application_1414519516219_0002/filecache/18/slider.jar!/webapps/slideram
> to /tmp/Jetty_0_0_0_0_47481_slideram____.7p4s4g/webapp
> 14/10/28 18:25:43 INFO mortbay.log: Started [email protected].
> 0:47481
> 14/10/28 18:25:43 INFO webapp.WebApps: Web app /slideram started at 47481
> 14/10/28 18:25:43 INFO webapp.WebApps: Registered webapp guice modules
> 14/10/28 18:25:43 INFO appmaster.SliderAppMaster: Connecting to RM at
> 46522,address tracking URL=http://vertica2.rzhang.com:47481
> 14/10/28 18:25:43 INFO agent.AgentUtils: Reading metainfo at
> hdfs://rzhang-HP-ZBook-15:10433/slider/slider_test.zip
> 14/10/28 18:25:44 INFO tools.SliderUtils: Reading metainfo.xml of size 3193
> 14/10/28 18:25:44 INFO agent.HeartbeatMonitor: Starting heartbeat monitor
> with interval 60000
> 14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_MASTER
> 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER assigned
> priority 1
> 14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_SLAVE
> 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE assigned
> priority 2
> 14/10/28 18:25:44 INFO state.AppState: Role slider-appmaster flexed from 0
> to 1
> 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE flexed from 0 to
> 2
> 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER flexed from 0
> to 1
> 14/10/28 18:25:44 INFO state.RoleHistory: loaded history from
> hdfs://rzhang-HP-ZBook-15:10433/user/rzhang/.slider/
> cluster/slider_test/history/rolehistory-0000014957f14d86.json
> 14/10/28 18:25:44 INFO appmaster.SliderAppMaster: service instances
> already running: []
> 14/10/28 18:25:44 INFO curator.RegistryBinderService: registering
> ServiceInstance{name='org-apache-slider', id='slider_test',
> address='172.17.0.3', port=47481, sslPort=null, 
> payload=ServiceInstanceData{id='slider_test',
> serviceType='org-apache-slider'}, registrationTimeUTC=1414520744939,
> serviceType=DYNAMIC, uriSpec=org.apache.curator.x.
> discovery.UriSpec@54515c2}
> 14/10/28 18:25:45 INFO curator.RegistryBinderService: registration
> completed ServiceInstance{name='org-apache-slider', id='slider_test',
> address='172.17.0.3', port=47481, sslPort=null, 
> payload=ServiceInstanceData{id='slider_test',
> serviceType='org-apache-slider'}, registrationTimeUTC=1414520744939,
> serviceType=DYNAMIC, uriSpec=org.apache.curator.x.
> discovery.UriSpec@54515c2}
> 14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Chaos monkey disabled
> 14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Adding Chaos Monkey
> scheduled every 0 seconds (0 hours)
> 14/10/28 18:25:45 INFO workflow.WorkflowCompositeService: Child service
> completed Service SliderAMProviderService in state SliderAMProviderService:
> STOPPED; current service null; queued service count=0
> 14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Process has exited with
> exit code 0 mapped to 0 -ignoring
> 14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_SLAVE',
> key=2, desired=2, actual=0, requested=0, releasing=0, failed=0, started=0,
> startFailed=0, completed=0, failureMessage=''}
> 14/10/28 18:25:45 INFO state.AppState: VERTICA_SLAVE: Asking for 2 more
> nodes(s) for a total of 2
> 14/10/28 18:25:45 INFO state.RoleHistory: There're 2 nodes to consider for
> VERTICA_SLAVE
> 14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for
> container on vertica2.rzhang.com
> 14/10/28 18:25:45 INFO state.AppState: Container ask is
> Capability[<memory:1024, vCores:1>]Priority[2]
> 14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for
> VERTICA_SLAVE
> 14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for
> container on vertica0.rzhang.com
> 14/10/28 18:25:45 INFO state.AppState: Container ask is
> Capability[<memory:1024, vCores:1>]Priority[2]
> 14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_MASTER',
> key=1, desired=1, actual=0, requested=0, releasing=0, failed=0, started=0,
> startFailed=0, completed=0, failureMessage=''}
> 14/10/28 18:25:45 INFO state.AppState: VERTICA_MASTER: Asking for 1 more
> nodes(s) for a total of 1
> 14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for
> VERTICA_MASTER
> 14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for
> container on vertica1
> 14/10/28 18:25:45 INFO state.AppState: Container ask is
> Capability[<memory:1024, vCores:1>]Priority[1]
> 14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica2.rzhang.com to
> /default-rack
> 14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica0.rzhang.com to
> /default-rack
> 14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica1 to
> /default-rack
> 14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for :
> vertica0.rzhang.com:54106
> 14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for :
> vertica2.rzhang.com:41175
> 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: onContainersAllocated(2)
> 14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to
> container container_1414519516219_0002_01_000002, on
> vertica0.rzhang.com:54106,
> 14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to
> container container_1414519516219_0002_01_000003, on
> vertica2.rzhang.com:41175,
> 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Diagnostics:
> RoleStatus{name='slider-appmaster', key=0, desired=1, actual=0,
> requested=0, releasing=0, failed=0, started=0, startFailed=0, completed=0,
> failureMessage=''}
> RoleStatus{name='VERTICA_SLAVE', key=2, desired=2, actual=2, requested=0,
> releasing=0, failed=0, started=0, startFailed=0, completed=0,
> failureMessage=''}
> RoleStatus{name='VERTICA_MASTER', key=1, desired=1, actual=0,
> requested=1, releasing=0, failed=0, started=0, startFailed=0, completed=0,
> failureMessage=''}
>
> 14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context
> for Agent
> 14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context
> for Agent
> 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to
> $PWD
> 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to
> $LOG_DIRS
> 14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to
> ./infra/agent/slider-agent/
> 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to
> $PWD
> 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to
> $LOG_DIRS
> 14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to
> ./infra/agent/slider-agent/
> 14/10/28 18:25:46 INFO agent.AgentProviderService: Using
> ./infra/agent/slider-agent/agent/main.py for agent.
> 14/10/28 18:25:46 INFO agent.AgentProviderService: Using
> ./infra/agent/slider-agent/agent/main.py for agent.
> 14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container
> with command: python ./infra/agent/slider-agent/agent/main.py --label
> container_1414519516219_0002_01_000002___VERTICA_SLAVE --zk-quorum
> rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test
> ;
> 14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container
> with command: python ./infra/agent/slider-agent/agent/main.py --label
> container_1414519516219_0002_01_000003___VERTICA_SLAVE --zk-quorum
> rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test
> ;
> 14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType:
> START_CONTAINER for Container container_1414519516219_0002_01_000002
> 14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType:
> START_CONTAINER for Container container_1414519516219_0002_01_000003
> 14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening
> proxy : vertica0.rzhang.com:54106
> 14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening
> proxy : vertica2.rzhang.com:41175
> 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container
> container_1414519516219_0002_01_000002
> 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container
> container_1414519516219_0002_01_000003
> 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of
> role VERTICA_SLAVE onto container_1414519516219_0002_01_000002
> 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component
> container_1414519516219_0002_01_000002
> 14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType:
> QUERY_CONTAINER for Container container_1414519516219_0002_01_000002
> 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of
> role VERTICA_SLAVE onto container_1414519516219_0002_01_000003
> 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component
> container_1414519516219_0002_01_000003
> 14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType:
> QUERY_CONTAINER for Container container_1414519516219_0002_01_000003
>
> Thanks,
> Rui
>
>
>
>
> On 10/28/2014 01:47 PM, Sumit Mohanty wrote:
>
>> There is a bug fix that went in few days back -
>> https://issues.apache.org/jira/browse/SLIDER-439 - that specifically
>> fixed
>> this issue.
>>
>> thanks
>> -Sumit
>>
>> On Tue, Oct 28, 2014 at 10:36 AM, Rui Zhang <[email protected]> wrote:
>>
>>  Hi,
>>>
>>> When I killed a node manager manually and restart the application, it
>>> seems that an instance previously ran on that node manager is not able to
>>> restart. Why is this?  I think Yarn should allocate a container on a
>>> different machine for this instance, right?
>>>
>>> Thanks,
>>> Rui
>>>
>>> --
>>> Rui Zhang
>>> Software engineer Intern
>>> Vertica, an HP Company
>>> [email protected]
>>>
>>>
>>>
> --
> Rui Zhang
> Software engineer Intern
> Vertica, an HP Company
> [email protected]
>
>

Reply via email to