Hi guys, any update on this?
Best On Wed, Apr 20, 2016 at 3:00 AM, Niranda Perera <niranda.per...@gmail.com> wrote: > Hi Reynold, > > I have created a JIRA for this [1]. I have also created a PR for the same > issue [2]. > > Would be very grateful if you could look into this, because this is a > blocker in our spark deployment, which uses number of spark custom > extension. > > thanks > best > > [1] https://issues.apache.org/jira/browse/SPARK-14736 > [2] https://github.com/apache/spark/pull/12506 > > On Mon, Apr 18, 2016 at 9:02 AM, Reynold Xin <r...@databricks.com> wrote: > >> I haven't looked closely at this, but I think your proposal makes sense. >> >> >> On Sun, Apr 17, 2016 at 6:40 PM, Niranda Perera <niranda.per...@gmail.com >> > wrote: >> >>> Hi guys, >>> >>> Any update on this? >>> >>> Best >>> >>> On Tue, Apr 12, 2016 at 12:46 PM, Niranda Perera < >>> niranda.per...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> I have encountered a small issue in the standalone recovery mode. >>>> >>>> Let's say there was an application A running in the cluster. Due to >>>> some issue, the entire cluster, together with the application A goes down. >>>> >>>> Then later on, cluster comes back online, and the master then goes into >>>> the 'recovering' mode, because it sees some apps, workers and drivers have >>>> already been in the cluster from Persistence Engine. While in the recovery >>>> process, the application comes back online, but now it would have a >>>> different ID, let's say B. >>>> >>>> But then, as per the master, application registration logic, this >>>> application B will NOT be added to the 'waitingApps' with the message >>>> ""Attempted to re-register application at same address". [1] >>>> >>>> private def registerApplication(app: ApplicationInfo): Unit = { >>>> val appAddress = app.driver.address >>>> if (addressToApp.contains(appAddress)) { >>>> logInfo("Attempted to re-register application at same address: " >>>> + appAddress) >>>> return >>>> } >>>> >>>> >>>> The problem here is, master is trying to recover application A, which >>>> is not in there anymore. Therefore after the recovery process, app A will >>>> be dropped. However app A's successor, app B was also omitted from the >>>> 'waitingApps' list because it had the same address as App A previously. >>>> >>>> This creates a deadlock in the cluster, app A nor app B is available in >>>> the cluster. >>>> >>>> When the master is in the RECOVERING mode, shouldn't it add all the >>>> registering apps to a list first, and then after the recovery is completed >>>> (once the unsuccessful recoveries are removed), deploy the apps which are >>>> new? >>>> >>>> This would sort this deadlock IMO? >>>> >>>> look forward to hearing from you. >>>> >>>> best >>>> >>>> [1] >>>> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/master/Master.scala#L834 >>>> >>>> -- >>>> Niranda >>>> @n1r44 <https://twitter.com/N1R44> >>>> +94-71-554-8430 >>>> https://pythagoreanscript.wordpress.com/ >>>> >>> >>> >>> >>> -- >>> Niranda >>> @n1r44 <https://twitter.com/N1R44> >>> +94-71-554-8430 >>> https://pythagoreanscript.wordpress.com/ >>> >> >> > > > -- > Niranda > @n1r44 <https://twitter.com/N1R44> > +94-71-554-8430 > https://pythagoreanscript.wordpress.com/ > -- Niranda @n1r44 <https://twitter.com/N1R44> +94-71-554-8430 https://pythagoreanscript.wordpress.com/