Re: Possibility of saving checkpoints on other distributed filesystems

Pramod Immaneni Tue, 02 Feb 2016 08:49:30 -0800

Yogi,

kill is not an orderly shutdown, who will clean the state?


On Tue, Feb 2, 2016 at 8:38 AM, Yogi Devendra <[email protected]>
wrote:

> I would prefer to have an additional argument during application launch on
> dtcli.
>
> Say, --preserve-kill-state true .
>
> Basically, platform should be able to do the clean-up activity if the
> application is invoked with certain flag.
>
> Test apps can set this flag to clear the data on kill. Production apps can
> set this flag to keep the data on kill.
>
> Shutdown should always preserve the state. But, for kill / forced-shutdown
> user might prefer to clear the state.
>
> ~ Yogi
>
> On 2 February 2016 at 21:53, Amol Kekre <[email protected]> wrote:
>
>>
>> Can we include a script in our github (util?) that simply deletes these
>> files upon application being killed, given an app-id. The admin will need
>> to run this script. Auto-deleting will be bad as a lot of users, including
>> those in production today need to restart using those files. The
>> knowledge/desire to restart post failure is outside the app and hence
>> technically the script should be explicitly user invoked
>>
>> Thks,
>> Amol
>>
>>
>> On Tue, Feb 2, 2016 at 6:12 AM, Pramod Immaneni <[email protected]>
>> wrote:
>>
>>> Hi Venkat,
>>>
>>> There are typically a small number of outstanding checkpoint files per
>>> operator, as newer checkpoints are created old ones are automatically
>>> deleted by the application when it determines that state is no longer
>>> needed. When an application stops/killed the last checkpoints remain.
>>> There
>>> is also a benefit to that since a new application can be restarted to
>>> continue from those checkpoints instead of starting all the way from the
>>> beginning and this is useful in some cases. But if you are always
>>> starting
>>> your application from scratch yes you can delete the checkpoints of older
>>> applications that are no longer running.
>>>
>>> Thanks
>>>
>>> On Mon, Feb 1, 2016 at 10:19 PM, Kottapalli, Venkatesh <
>>> [email protected]> wrote:
>>>
>>> > Hi,
>>> >
>>> >         Now that this has been discussed, Will the checkpointed data be
>>> > purged when we kill the application forcefully?  In our current usage,
>>> we
>>> > forcefully kill the app after it processes a certain batch of data. I
>>> see
>>> > these small files are created under (user/datatorrent) directory and
>>> not
>>> > removed.
>>> >
>>> >         Another scenario, when some of the containers keep failing, we
>>> > have observed this state where the data is continuously checkpointed
>>> into
>>> > small files. When we kill the app, the data will be there.
>>> >
>>> >         We have received concerns saying this is impacting namenode
>>> > performance since these small files are stored in HDFS. So we manually
>>> > remove these checkpointed data at regular intervals.
>>> >
>>> > -Venkatesh
>>> >
>>> > -----Original Message-----
>>> > From: Amol Kekre [mailto:[email protected]]
>>> > Sent: Monday, February 01, 2016 7:49 AM
>>> > To: [email protected]; [email protected]
>>> > Subject: Re: Possibility of saving checkpoints on other distributed
>>> > filesystems
>>> >
>>> > Aniruddha,
>>> > We have not heard this request from users yet. It may be because our
>>> > checkpointing has a purge, i.e. the small files are not left over.
>>> Small
>>> > file problem has been there in Hadoop and relates to storing small
>>> files in
>>> > Hadoop for a longer time (more likely forever).
>>> >
>>> > Thks,
>>> > Amol
>>> >
>>> >
>>> > On Mon, Feb 1, 2016 at 6:05 AM, Aniruddha Thombare <
>>> > [email protected]> wrote:
>>> >
>>> > > Hi Community,
>>> > >
>>> > > Or Let me say BigFoots, do you think this feature should be
>>> available?
>>> > >
>>> > > The reason to bring this up was discussed in the start of this
>>> thread as:
>>> > >
>>> > > This is with the intention to recover the applications faster and do
>>> > > away
>>> > > > with HDFS's small files problem as described here:
>>> > > > http://blog.cloudera.com/blog/2009/02/the-small-files-problem/
>>> > > >
>>> > > >
>>> > >
>>> http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-smal
>>> > > l-files-problem/
>>> > > >
>>> http://inquidia.com/news-and-info/working-small-files-hadoop-part-1
>>> > > > If we could save checkpoints in some other distributed file system
>>> > > > (or even a HA NAS box) geared for small files, we could achieve -
>>> > > >
>>> > > >    - Better performance of NN & HDFS for the production usage
>>> (read:
>>> > > >    production data I/O & not temp files)
>>> > > >
>>> > > >
>>> > > >    - Faster application recovery in case of planned shutdown /
>>> > unplanned
>>> > > >    restarts
>>> > > >
>>> > > > If you feel the need of this feature, please cast your opinions and
>>> > > > ideas
>>> > > so that it can be converted in a jira.
>>> > >
>>> > >
>>> > >
>>> > > Thanks,
>>> > >
>>> > >
>>> > > Aniruddha
>>> > >
>>> > > On Thu, Jan 21, 2016 at 11:19 PM, Gaurav Gupta
>>> > > <[email protected]>
>>> > > wrote:
>>> > >
>>> > > > Aniruddha,
>>> > > >
>>> > > > Currently we don't have any support for that.
>>> > > >
>>> > > > Thanks
>>> > > > Gaurav
>>> > > >
>>> > > > Thanks
>>> > > > -Gaurav
>>> > > >
>>> > > > On Thu, Jan 21, 2016 at 12:24 AM, Tushar Gosavi
>>> > > > <[email protected]>
>>> > > > wrote:
>>> > > >
>>> > > > > Default FSStorageAgent can be used as it can work with local
>>> > > filesystem,
>>> > > > > but I far as I know there is no support for specifying the
>>> > > > > directory through xml file. by default it use the application
>>> > directory on HDFS.
>>> > > > >
>>> > > > > Not sure If we could specify storage agent with its properties
>>> > > > > through
>>> > > > the
>>> > > > > configuration at dag level.
>>> > > > >
>>> > > > > - Tushar.
>>> > > > >
>>> > > > >
>>> > > > > On Thu, Jan 21, 2016 at 12:14 PM, Aniruddha Thombare <
>>> > > > > [email protected]> wrote:
>>> > > > >
>>> > > > > > Hi,
>>> > > > > >
>>> > > > > > Do we have any storage agent which I can use readily,
>>> > > > > > configurable
>>> > > > > through
>>> > > > > > dt-site.xml?
>>> > > > > >
>>> > > > > > I am looking for something which would save checkpoints in
>>> > > > > > mounted
>>> > > file
>>> > > > > > system [eg. HA-NAS] which is basically just another directory
>>> > > > > > for
>>> > > Apex.
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > >
>>> > > > > > Thanks,
>>> > > > > >
>>> > > > > >
>>> > > > > > Aniruddha
>>> > > > > >
>>> > > > > > On Wed, Jan 20, 2016 at 8:33 PM, Sandesh Hegde <
>>> > > > [email protected]>
>>> > > > > > wrote:
>>> > > > > >
>>> > > > > > > It is already supported refer the following jira for more
>>> > > > information,
>>> > > > > > >
>>> > > > > > > https://issues.apache.org/jira/browse/APEXCORE-283
>>> > > > > > >
>>> > > > > > >
>>> > > > > > >
>>> > > > > > > On Tue, Jan 19, 2016 at 10:43 PM Aniruddha Thombare <
>>> > > > > > > [email protected]> wrote:
>>> > > > > > >
>>> > > > > > > > Hi,
>>> > > > > > > >
>>> > > > > > > > Is it possible to save checkpoints in any other highly
>>> > > > > > > > available distributed file systems (which maybe mounted
>>> > > > > > > > directories across
>>> > > > the
>>> > > > > > > > cluster) other than HDFS?
>>> > > > > > > > If yes, is it configurable?
>>> > > > > > > >
>>> > > > > > > > AFAIK, there is no configurable option available to achieve
>>> > that.
>>> > > > > > > > If that's the case, can we have that feature?
>>> > > > > > > >
>>> > > > > > > > This is with the intention to recover the applications
>>> > > > > > > > faster and
>>> > > > do
>>> > > > > > away
>>> > > > > > > > with HDFS's small files problem as described here:
>>> > > > > > > >
>>> > > > > > > >
>>> http://blog.cloudera.com/blog/2009/02/the-small-files-proble
>>> > > > > > > > m/
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> http://snowplowanalytics.com/blog/2013/05/30/dealing-with-hadoops-smal
>>> > > l-files-problem/
>>> > > > > > > >
>>> > > >
>>> http://inquidia.com/news-and-info/working-small-files-hadoop-part-1
>>> > > > > > > >
>>> > > > > > > > If we could save checkpoints in some other distributed file
>>> > > system
>>> > > > > (or
>>> > > > > > > even
>>> > > > > > > > a HA NAS box) geared for small files, we could achieve -
>>> > > > > > > >
>>> > > > > > > >    - Better performance of NN & HDFS for the production
>>> > > > > > > > usage
>>> > > > (read:
>>> > > > > > > >    production data I/O & not temp files)
>>> > > > > > > >    - Faster application recovery in case of planned
>>> shutdown
>>> > > > > > > > /
>>> > > > > > unplanned
>>> > > > > > > >    restarts
>>> > > > > > > >
>>> > > > > > > > Please, send your comments, suggestions or ideas.
>>> > > > > > > >
>>> > > > > > > > Thanks,
>>> > > > > > > >
>>> > > > > > > >
>>> > > > > > > > Aniruddha
>>> > > > > > > >
>>> > > > > > >
>>> > > > > >
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>
>>
>

Re: Possibility of saving checkpoints on other distributed filesystems

Reply via email to