Do your nodes have enough resources for all of the requested components to start?
On Tue, Oct 28, 2014 at 11:40 AM, Rui Zhang <[email protected]> wrote: > Made the fix but still cannot make it. > Actually, the steps to reproduce in SLIDER-439 is different from mine. > What I do is first use "freeze" command and then kill one node manager. > Wait long enough for the node manager leave the Yarn cluster. And then use > "thaw" command to restart. > However, the instance that was running on that killed node is not able to > restart. > > Here is part of the log. > > 14/10/28 18:25:42 INFO mortbay.log: Logging to org.slf4j.impl. > Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog > 14/10/28 18:25:42 INFO zookeeper.ClientCnxn: Session establishment > complete on server vertica1/172.17.42.1:16433, sessionid = > 0x14957f07d6f011f, negotiated timeout = 40000 > 14/10/28 18:25:42 INFO state.ConnectionStateManager: State change: > CONNECTED > 14/10/28 18:25:42 INFO mortbay.log: jetty-6.1.26 > Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.PackagesResourceConfig > init > INFO: Scanning for root resource and provider classes in the packages: > org.apache.slider.server.appmaster.web.rest.agent > Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig > logClasses > INFO: Root resource classes found: > class org.apache.slider.server.appmaster.web.rest.agent.AgentWebServices > Oct 28, 2014 6:25:42 PM com.sun.jersey.api.core.ScanningResourceConfig > init > INFO: No provider classes found. > Oct 28, 2014 6:25:42 PM > com.sun.jersey.server.impl.application.WebApplicationImpl > _initiate > INFO: Initiating Jersey application, version 'Jersey: 1.9 09/02/2011 11:17 > AM' > 14/10/28 18:25:43 INFO mortbay.log: Started [email protected]. > 0.0:46561 > 14/10/28 18:25:43 INFO mortbay.log: Started [email protected]. > 0.0:36451 > 14/10/28 18:25:43 INFO http.HttpRequestLog: Http request log for > http.requests.slideram is not defined > 14/10/28 18:25:43 INFO http.HttpServer2: Added global filter 'safety' > (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) > 14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER > (class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to > context slideram > 14/10/28 18:25:43 INFO http.HttpServer2: Added filter AM_PROXY_FILTER > (class=org.apache.slider.server.appmaster.web.SliderAmIpFilter) to > context static > 14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /slideram/* > 14/10/28 18:25:43 INFO http.HttpServer2: adding path spec: /ws/* > 14/10/28 18:25:43 INFO http.HttpServer2: Jetty bound to port 47481 > 14/10/28 18:25:43 INFO mortbay.log: jetty-6.1.26 > 14/10/28 18:25:43 INFO mortbay.log: Extract jar:file:/home/rzhang/Slider_ > Vertica/Linux64/Test/verticadb1000/HDP2_1/hadoop/local/usercache/rzhang/ > appcache/application_1414519516219_0002/filecache/18/slider.jar!/webapps/slideram > to /tmp/Jetty_0_0_0_0_47481_slideram____.7p4s4g/webapp > 14/10/28 18:25:43 INFO mortbay.log: Started [email protected]. > 0:47481 > 14/10/28 18:25:43 INFO webapp.WebApps: Web app /slideram started at 47481 > 14/10/28 18:25:43 INFO webapp.WebApps: Registered webapp guice modules > 14/10/28 18:25:43 INFO appmaster.SliderAppMaster: Connecting to RM at > 46522,address tracking URL=http://vertica2.rzhang.com:47481 > 14/10/28 18:25:43 INFO agent.AgentUtils: Reading metainfo at > hdfs://rzhang-HP-ZBook-15:10433/slider/slider_test.zip > 14/10/28 18:25:44 INFO tools.SliderUtils: Reading metainfo.xml of size 3193 > 14/10/28 18:25:44 INFO agent.HeartbeatMonitor: Starting heartbeat monitor > with interval 60000 > 14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_MASTER > 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER assigned > priority 1 > 14/10/28 18:25:44 INFO state.AppState: Adding new role VERTICA_SLAVE > 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE assigned > priority 2 > 14/10/28 18:25:44 INFO state.AppState: Role slider-appmaster flexed from 0 > to 1 > 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_SLAVE flexed from 0 to > 2 > 14/10/28 18:25:44 INFO state.AppState: Role VERTICA_MASTER flexed from 0 > to 1 > 14/10/28 18:25:44 INFO state.RoleHistory: loaded history from > hdfs://rzhang-HP-ZBook-15:10433/user/rzhang/.slider/ > cluster/slider_test/history/rolehistory-0000014957f14d86.json > 14/10/28 18:25:44 INFO appmaster.SliderAppMaster: service instances > already running: [] > 14/10/28 18:25:44 INFO curator.RegistryBinderService: registering > ServiceInstance{name='org-apache-slider', id='slider_test', > address='172.17.0.3', port=47481, sslPort=null, > payload=ServiceInstanceData{id='slider_test', > serviceType='org-apache-slider'}, registrationTimeUTC=1414520744939, > serviceType=DYNAMIC, uriSpec=org.apache.curator.x. > discovery.UriSpec@54515c2} > 14/10/28 18:25:45 INFO curator.RegistryBinderService: registration > completed ServiceInstance{name='org-apache-slider', id='slider_test', > address='172.17.0.3', port=47481, sslPort=null, > payload=ServiceInstanceData{id='slider_test', > serviceType='org-apache-slider'}, registrationTimeUTC=1414520744939, > serviceType=DYNAMIC, uriSpec=org.apache.curator.x. > discovery.UriSpec@54515c2} > 14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Chaos monkey disabled > 14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Adding Chaos Monkey > scheduled every 0 seconds (0 hours) > 14/10/28 18:25:45 INFO workflow.WorkflowCompositeService: Child service > completed Service SliderAMProviderService in state SliderAMProviderService: > STOPPED; current service null; queued service count=0 > 14/10/28 18:25:45 INFO appmaster.SliderAppMaster: Process has exited with > exit code 0 mapped to 0 -ignoring > 14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_SLAVE', > key=2, desired=2, actual=0, requested=0, releasing=0, failed=0, started=0, > startFailed=0, completed=0, failureMessage=''} > 14/10/28 18:25:45 INFO state.AppState: VERTICA_SLAVE: Asking for 2 more > nodes(s) for a total of 2 > 14/10/28 18:25:45 INFO state.RoleHistory: There're 2 nodes to consider for > VERTICA_SLAVE > 14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for > container on vertica2.rzhang.com > 14/10/28 18:25:45 INFO state.AppState: Container ask is > Capability[<memory:1024, vCores:1>]Priority[2] > 14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for > VERTICA_SLAVE > 14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for > container on vertica0.rzhang.com > 14/10/28 18:25:45 INFO state.AppState: Container ask is > Capability[<memory:1024, vCores:1>]Priority[2] > 14/10/28 18:25:45 INFO state.AppState: RoleStatus{name='VERTICA_MASTER', > key=1, desired=1, actual=0, requested=0, releasing=0, failed=0, started=0, > startFailed=0, completed=0, failureMessage=''} > 14/10/28 18:25:45 INFO state.AppState: VERTICA_MASTER: Asking for 1 more > nodes(s) for a total of 1 > 14/10/28 18:25:45 INFO state.RoleHistory: There're 1 nodes to consider for > VERTICA_MASTER > 14/10/28 18:25:45 INFO state.OutstandingRequest: Submitting request for > container on vertica1 > 14/10/28 18:25:45 INFO state.AppState: Container ask is > Capability[<memory:1024, vCores:1>]Priority[1] > 14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica2.rzhang.com to > /default-rack > 14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica0.rzhang.com to > /default-rack > 14/10/28 18:25:45 INFO util.RackResolver: Resolved vertica1 to > /default-rack > 14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for : > vertica0.rzhang.com:54106 > 14/10/28 18:25:46 INFO impl.AMRMClientImpl: Received new token for : > vertica2.rzhang.com:41175 > 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: onContainersAllocated(2) > 14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to > container container_1414519516219_0002_01_000002, on > vertica0.rzhang.com:54106, > 14/10/28 18:25:46 INFO state.AppState: Assigning role VERTICA_SLAVE to > container container_1414519516219_0002_01_000003, on > vertica2.rzhang.com:41175, > 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Diagnostics: > RoleStatus{name='slider-appmaster', key=0, desired=1, actual=0, > requested=0, releasing=0, failed=0, started=0, startFailed=0, completed=0, > failureMessage=''} > RoleStatus{name='VERTICA_SLAVE', key=2, desired=2, actual=2, requested=0, > releasing=0, failed=0, started=0, startFailed=0, completed=0, > failureMessage=''} > RoleStatus{name='VERTICA_MASTER', key=1, desired=1, actual=0, > requested=1, releasing=0, failed=0, started=0, startFailed=0, completed=0, > failureMessage=''} > > 14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context > for Agent > 14/10/28 18:25:46 INFO agent.AgentProviderService: Build launch context > for Agent > 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to > $PWD > 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to > $LOG_DIRS > 14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to > ./infra/agent/slider-agent/ > 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_WORK_ROOT set to > $PWD > 14/10/28 18:25:46 INFO agent.AgentProviderService: AGENT_LOG_ROOT set to > $LOG_DIRS > 14/10/28 18:25:46 INFO agent.AgentProviderService: PYTHONPATH set to > ./infra/agent/slider-agent/ > 14/10/28 18:25:46 INFO agent.AgentProviderService: Using > ./infra/agent/slider-agent/agent/main.py for agent. > 14/10/28 18:25:46 INFO agent.AgentProviderService: Using > ./infra/agent/slider-agent/agent/main.py for agent. > 14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container > with command: python ./infra/agent/slider-agent/agent/main.py --label > container_1414519516219_0002_01_000002___VERTICA_SLAVE --zk-quorum > rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test > ; > 14/10/28 18:25:46 INFO appmaster.RoleLaunchService: Starting container > with command: python ./infra/agent/slider-agent/agent/main.py --label > container_1414519516219_0002_01_000003___VERTICA_SLAVE --zk-quorum > rzhang-HP-ZBook-15:16433 --zk-reg-path /registry/org-apache-slider/slider_test > ; > 14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType: > START_CONTAINER for Container container_1414519516219_0002_01_000002 > 14/10/28 18:25:46 INFO impl.NMClientAsyncImpl: Processing Event EventType: > START_CONTAINER for Container container_1414519516219_0002_01_000003 > 14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening > proxy : vertica0.rzhang.com:54106 > 14/10/28 18:25:46 INFO impl.ContainerManagementProtocolProxy: Opening > proxy : vertica2.rzhang.com:41175 > 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container > container_1414519516219_0002_01_000002 > 14/10/28 18:25:46 INFO appmaster.SliderAppMaster: Started Container > container_1414519516219_0002_01_000003 > 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of > role VERTICA_SLAVE onto container_1414519516219_0002_01_000002 > 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component > container_1414519516219_0002_01_000002 > 14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType: > QUERY_CONTAINER for Container container_1414519516219_0002_01_000002 > 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Deployed instance of > role VERTICA_SLAVE onto container_1414519516219_0002_01_000003 > 14/10/28 18:25:47 INFO appmaster.SliderAppMaster: Registering component > container_1414519516219_0002_01_000003 > 14/10/28 18:25:47 INFO impl.NMClientAsyncImpl: Processing Event EventType: > QUERY_CONTAINER for Container container_1414519516219_0002_01_000003 > > Thanks, > Rui > > > > > On 10/28/2014 01:47 PM, Sumit Mohanty wrote: > >> There is a bug fix that went in few days back - >> https://issues.apache.org/jira/browse/SLIDER-439 - that specifically >> fixed >> this issue. >> >> thanks >> -Sumit >> >> On Tue, Oct 28, 2014 at 10:36 AM, Rui Zhang <[email protected]> wrote: >> >> Hi, >>> >>> When I killed a node manager manually and restart the application, it >>> seems that an instance previously ran on that node manager is not able to >>> restart. Why is this? I think Yarn should allocate a container on a >>> different machine for this instance, right? >>> >>> Thanks, >>> Rui >>> >>> -- >>> Rui Zhang >>> Software engineer Intern >>> Vertica, an HP Company >>> [email protected] >>> >>> >>> > -- > Rui Zhang > Software engineer Intern > Vertica, an HP Company > [email protected] > >
