Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Roman Khachatryan Wed, 16 Nov 2022 12:19:34 -0800

> I think not being able to isolate all kinds of memory does not mean we
> should give up the isolation on all kinds of memory. And I believe
"managed
> memory is isolated and others are not" is much easier for the users to
> understand compared to "part of the managed memory is isolated and others
> are not".


It looks like the users can not infer that managed memory has the isolation
property:
neither documentation [1] nor the FILP don't mention that. I guess this is
because isolation is not its most important property (see below).

An explicit option would be self-documenting and will let the users know
what memory is shared and what isn't.

>From my perspective, the most important property is shared budget which
allows
to avoid:
1) OOM errors when there are too many consumers (i.e. tasks)
2) manual calculation of memory for each type of consumer

Both these properties are desirable for shared and non-shared memory;
as well as not wasting memory if there is no consumer as you described.

OTH, using *unmanaged* shared memory for RocksDB implies:
- there's one more memory type, in addition to 8 existing types [2]
- its size needs to be calculated manually
- flink code needs to take this into account and "adjust" the weights

Having said that, I'd be fine with either approach as both seem to have
pros and cons.

What do you think?

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model

Regards,
Roman


On Wed, Nov 16, 2022 at 4:01 AM Xintong Song <[email protected]> wrote:

> Concerning isolation, I think ideally we want everything to be isolated
> between jobs running in the same cluster (i.e., slots in the same TM).
> Unfortunately, this is impractical.
> - Heap / Off-heap memory are directly allocated / deallocated through JVM /
> OS. Flink does not have a good way to cap their usages per slot.
> - Network memory does not have the good property of managed memory, that a
> job can adapt to any given amount of managed memory (with a very small min
> limitation). We are trying to improve network memory towards that
> direction, and once achieved it can be isolated as well.
> I think not being able to isolate all kinds of memory does not mean we
> should give up the isolation on all kinds of memory. And I believe "managed
> memory is isolated and others are not" is much easier for the users to
> understand compared to "part of the managed memory is isolated and others
> are not".
>
> By waste, I meant reserving a certain amount of memory that is only used by
> certain use cases that do not always exist. This is exactly what we want to
> avoid with managed memory in FLIP-49 [1]. We used to have managed memory
> only used for batch operators, and a containerized-cut-off memory
> (something similar to framework.off-heap) for rocksdb state backend. The
> problem was that, if the user does not change the configuration when
> switching between streaming / batch jobs, there would always be some memory
> (managed or cut-off) wasted. Similarly, introducing a shared managed memory
> zone means reserving one more dedicated part of memory that can get wasted
> in many cases. This is probably a necessary price for this new feature, but
> let's not break the concept / properties of managed memory for it.
>
> In your proposal, the fraction for the share managed memory is by default
> 0. That means to enable the rocksdb memory sharing, users need to manually
> increase the fraction anyway. Thus, having the memory sharing rocksdb use
> managed memory or off-heap memory does not make a significant difference
> for the new feature users. I'd think of this as "extra operational overhead
> for users of a certain new feature" vs. "significant learning cost and
> potential behavior change for pretty much all users". I'd be fine with
> having some shortcuts to simplify the configuration on the user side for
> this new feature, but not to invade the managed memory.
>
> Best,
>
> Xintong
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman <
> [email protected]> wrote:
>
> > Thanks for your reply Xingong Song,
> >
> > Could you please elaborate on the following:
> >
> > > The proposed changes break several good properties that we designed for
> > > managed memory.
> > > 1. It's isolated across slots
> > Just to clarify, any way to manage the memory efficiently while capping
> its
> > usage
> > will break the isolation. It's just a matter of whether it's managed
> memory
> > or not.
> > Do you see any reasons why unmanaged memory can be shared, and managed
> > memory can not?
> >
> > > 2. It should never be wasted (unless there's nothing in the job that
> > needs
> > > managed memory)
> > If I understand correctly, the managed memory can already be wasted
> because
> > it is divided evenly between slots, regardless of the existence of its
> > consumers in a particular slot.
> > And in general, even if every slot has RocksDB / python, it's not
> > guaranteed equal consumption.
> > So this property would rather be fixed in the current proposal.
> >
> > > In addition, it further complicates configuration / computation logics
> of
> > > managed memory.
> > I think having multiple options overriding each other increases the
> > complexity for the user. As for the computation, I think it's desirable
> to
> > let Flink do it rather than users.
> >
> > Both approaches need some help from TM for:
> > - storing the shared resources (static field in a class might be too
> > dangerous because if the backend is loaded by the user-class-loader then
> > memory will leak silently).
> > - reading the configuration
> >
> > Regards,
> > Roman
> >
> >
> > On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <[email protected]>
> > wrote:
> >
> > > I like the idea of sharing RocksDB memory across slots. However, I'm
> > quite
> > > concerned by the current proposed approach.
> > >
> > > The proposed changes break several good properties that we designed for
> > > managed memory.
> > > 1. It's isolated across slots
> > > 2. It should never be wasted (unless there's nothing in the job that
> > needs
> > > managed memory)
> > > In addition, it further complicates configuration / computation logics
> of
> > > managed memory.
> > >
> > > As an alternative, I'd suggest introducing a variant of
> > > RocksDBStateBackend, that shares memory across slots and does not use
> > > managed memory. This basically means the shared memory is not
> considered
> > as
> > > part of managed memory. For users of this new feature, they would need
> to
> > > configure how much memory the variant state backend should use, and
> > > probably also a larger framework-off-heap / jvm-overhead memory. The
> > latter
> > > might require a bit extra user effort compared to the current approach,
> > but
> > > would avoid significant complexity in the managed memory configuration
> > and
> > > calculation logics which affects broader users.
> > >
> > >
> > > Best,
> > >
> > > Xintong
> > >
> > >
> > >
> > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <[email protected]>
> > > wrote:
> > >
> > > > Hi John, Yun,
> > > >
> > > > Thank you for your feedback
> > > >
> > > > @John
> > > >
> > > > > It seems like operators would either choose isolation for the
> > cluster’s
> > > > jobs
> > > > > or they would want to share the memory between jobs.
> > > > > I’m not sure I see the motivation to reserve only part of the
> memory
> > > for
> > > > sharing
> > > > > and allowing jobs to choose whether they will share or be isolated.
> > > >
> > > > I see two related questions here:
> > > >
> > > > 1) Whether to allow mixed workloads within the same cluster.
> > > > I agree that most likely all the jobs will have the same "sharing"
> > > > requirement.
> > > > So we can drop "state.backend.memory.share-scope" from the proposal.
> > > >
> > > > 2) Whether to allow different memory consumers to use shared or
> > exclusive
> > > > memory.
> > > > Currently, only RocksDB is proposed to use shared memory. For python,
> > > it's
> > > > non-trivial because it is job-specific.
> > > > So we have to partition managed memory into shared/exclusive and
> > > therefore
> > > > can NOT replace "taskmanager.memory.managed.shared-fraction" with
> some
> > > > boolean flag.
> > > >
> > > > I think your question was about (1), just wanted to clarify why the
> > > > shared-fraction is needed.
> > > >
> > > > @Yun
> > > >
> > > > > I am just curious whether this could really bring benefits to our
> > users
> > > > with such complex configuration logic.
> > > > I agree, and configuration complexity seems a common concern.
> > > > I hope that removing "state.backend.memory.share-scope" (as proposed
> > > above)
> > > > reduces the complexity.
> > > > Please share any ideas of how to simplify it further.
> > > >
> > > > > Could you share some real experimental results?
> > > > I did an experiment to verify that the approach is feasible,
> > > > i.e. multilple jobs can share the same memory/block cache.
> > > > But I guess that's not what you mean here? Do you have any
> experiments
> > in
> > > > mind?
> > > >
> > > > > BTW, as talked before, I am not sure whether different lifecycles
> of
> > > > RocksDB state-backends
> > > > > would affect the memory usage of block cache & write buffer manager
> > in
> > > > RocksDB.
> > > > > Currently, all instances would start and destroy nearly
> > simultaneously,
> > > > > this would change after we introduce this feature with jobs running
> > at
> > > > different scheduler times.
> > > > IIUC, the concern is that closing a RocksDB instance might close the
> > > > BlockCache.
> > > > I checked that manually and it seems to work as expected.
> > > > And I think that would contradict the sharing concept, as described
> in
> > > the
> > > > documentation [1].
> > > >
> > > > [1]
> > > > https://github.com/facebook/rocksdb/wiki/Block-Cache
> > > >
> > > > Regards,
> > > > Roman
> > > >
> > > >
> > > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <[email protected]>
> wrote:
> > > >
> > > > > Hi Roman,
> > > > > Thanks for the proposal, this allows State Backend to make better
> use
> > > of
> > > > > memory.
> > > > >
> > > > > After reading the ticket, I'm curious about some points:
> > > > >
> > > > > 1. Is shared-memory only for the state backend? If both
> > > > > "taskmanager.memory.managed.shared-fraction: >0" and
> > > > > "state.backend.rocksdb.memory.managed: false" are set at the same
> > time,
> > > > > will the shared-memory be wasted?
> > > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of
> unmanaged
> > > > memory
> > > > > and will compete with each other" in the example, how is the memory
> > > size
> > > > of
> > > > > unmanaged part calculated?
> > > > > 3. For fine-grained-resource-management, the control
> > > > > of cpuCores, taskHeapMemory can still work, right?  And I am a
> little
> > > > > worried that too many memory-about configuration options are
> > > complicated
> > > > > for users to understand.
> > > > >
> > > > > Regards,
> > > > > Yanfei
> > > > >
> > > > > Roman Khachatryan <[email protected]> 于2022年11月8日周二 23:22写道：
> > > > >
> > > > > > Hi everyone,
> > > > > >
> > > > > > I'd like to discuss sharing RocksDB memory across slots as
> proposed
> > > in
> > > > > > FLINK-29928 [1].
> > > > > >
> > > > > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > > > > - share these objects among RocksDB instances of the same slot
> > > > > > - bound the total memory usage by all RocksDB instances of a TM
> > > > > >
> > > > > > However, the memory is divided between the slots equally (unless
> > > using
> > > > > > fine-grained resource control). This is sub-optimal if some slots
> > > > contain
> > > > > > more memory intensive tasks than the others.
> > > > > > Using fine-grained resource control is also often not an option
> > > because
> > > > > the
> > > > > > workload might not be known in advance.
> > > > > >
> > > > > > The proposal is to widen the scope of sharing memory to TM, so
> that
> > > it
> > > > > can
> > > > > > be shared across all RocksDB instances of that TM. That would
> > reduce
> > > > the
> > > > > > overall memory consumption in exchange for resource isolation.
> > > > > >
> > > > > > Please see FLINK-29928 [1] for more details.
> > > > > >
> > > > > > Looking forward to feedback on that proposal.
> > > > > >
> > > > > > [1]
> > > > > > https://issues.apache.org/jira/browse/FLINK-29928
> > > > > > [2]
> > > > > > https://issues.apache.org/jira/browse/FLINK-7289
> > > > > >
> > > > > > Regards,
> > > > > > Roman
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Reply via email to