Re: Follow up to discussion regarding use : in paths on Windows (MESOS-9109)

2018-08-23 Thread Chun-Hung Hsiao
I'm a bit concerned about the recovery logic and backward compatibility:
The changes we're making shouldn't affect existing users,
and we should try hard to avoid any future backward compatibility problem.

Say if there is already some custom framework using task ID 'Hello%3AWorld',
then if we blindly decode the task path during recovery, we will get the
wrong ID 'Hello:World'.
On the other hand, if we don't decode the task path during recovery,
then later on during checkpointing for the same task,
we shouldn't blindly encode the task ID, because it might create a
different path,
and we'll need to introduce some migration code to avoid such duplication.

Fortunately, we do checkpoint the executor IDs and task IDs in the info
files under the meta dir.
So I'm considering the following design to minimize the backward
compatibility issue we might have:
During recovery, we don't decode the recovered task path,
but get the executor/task ID from the info file instead of relying on
parsing the executor/task path.
When checkpointing, we only encode executor/task IDs if they contain
reserved characters.
The set of reserved characters could be defined as a platform-dependent
variable,
similar to what we have done for `PATH_SEPARATOR`.

The above design would look a bit more complicated then just blindly
applying percent encoding
to when constructing checkpoint paths, but it doesn't require extra
checkpoint migration logic,
and would keep the exact same behavior we have now for "normal"
executor/task IDs.

What did you guys think? Please feel free to raise any concern :)
And we don't need to implement the whole thing for now.
For example, we could start with just dealing with colons,
and extend the implementation later on,
as long as the partial solution we're going to have right now doesn't
create future tech debts!

Best,
Chun-Hung

On Thu, Aug 23, 2018 at 1:42 PM Greg Mann  wrote:

> Thanks Andy! Responses inlined below.
>
>
>
>> No: As the only character we've run into a problem with is `:`
>> (MESOS-9109), it might not be worth it to generalize this to solve a bunch
>> of problems that we haven't encountered.
>>
>>
> It's true that I'm not aware of other scenarios where
> filesystem-disallowed characters in task/executor IDs have caused issues
> for users, and this issue has existed for a long time. However, when
> feasible I would like to fix issues that we're aware of before they cause
> problems for users, rather than after. I would suggest that since we have
> one compelling case that we need to address now, it's worth formulating an
> approach for the general case, so that we can be sure any current work
> doesn't get in our way later on.
>
>
>> I'm somewhat comfortable doing so only for Windows, as we don't really
>> need to worry about the recovery scenario; but very uncomfortable about
>> doing so for Linux etc., for precisely that reason.
>>
>> So expanding this is definitely up for debate; but we must fix the bug
>> with `:`.
>>
>>
> Indeed, addressing the general case may prove to be much more complex - I
> can certainly identify with this situation, where a fix for a smaller issue
> turns into a big project :)
> It may turn out to be possible to implement a scoped-down solution for the
> colon case now, and extend it later on. I think it would be good if we
> could at least get an idea of how we want to handle the general case now,
> so that any short-term solutions can be a constructive step toward the
> long-term.
>
> Cheers,
> G
>


Re: Follow up to discussion regarding use : in paths on Windows (MESOS-9109)

2018-08-23 Thread Greg Mann
Thanks Andy! Responses inlined below.



> No: As the only character we've run into a problem with is `:`
> (MESOS-9109), it might not be worth it to generalize this to solve a bunch
> of problems that we haven't encountered.
>
>
It's true that I'm not aware of other scenarios where filesystem-disallowed
characters in task/executor IDs have caused issues for users, and this
issue has existed for a long time. However, when feasible I would like to
fix issues that we're aware of before they cause problems for users, rather
than after. I would suggest that since we have one compelling case that we
need to address now, it's worth formulating an approach for the general
case, so that we can be sure any current work doesn't get in our way later
on.


> I'm somewhat comfortable doing so only for Windows, as we don't really
> need to worry about the recovery scenario; but very uncomfortable about
> doing so for Linux etc., for precisely that reason.
>
> So expanding this is definitely up for debate; but we must fix the bug
> with `:`.
>
>
Indeed, addressing the general case may prove to be much more complex - I
can certainly identify with this situation, where a fix for a smaller issue
turns into a big project :)
It may turn out to be possible to implement a scoped-down solution for the
colon case now, and extend it later on. I think it would be good if we
could at least get an idea of how we want to handle the general case now,
so that any short-term solutions can be a constructive step toward the
long-term.

Cheers,
G


[Containerization] Mesos CNI support

2018-08-23 Thread Jie Yu
Hi,

If you are not using Mesos CNI integration, please ignore this email.

Recently, we discovered a few bugs related to our CNI integration when
there are lots of containers (200+) on a single agent box. Please see the
following tickets for details.

MESOS-9125 : Port mapper
CNI plugin might fail with "Resource temporarily unavailable
MESOS-9127 : Port mapper
CNI plugin might deadlock iptables on the agent
MESOS-9142 : CNI detach
might fail due to missing network config file

We fixed all of them, and backported the fixes to the currently maintained
patch releases. Please refer to the corresponding tickets for the fix
versions.

Also, if you're using any CNI plugin that depends on the host local IPAM
,
for example, the bridge plugin
, please make sure the host
local IPAM's `dataDir`

points to a tmpfs which gets cleaned up when reboot happens. Otherwise, the
IP address will be leaked, and you will hit IP allocation failure if your
box has constant reboots.

Please reach out if you have any question!

- Jie


[Containerization WG] Today's WG Sync

2018-08-23 Thread Gilbert Song
Hi folks,

I apologize that I did not have a chance to confirm the agenda with
proposers yesterday. Feel free to add items that you would like to discuss
to the agenda:
https://docs.google.com/document/d/1z55a7tLZFoRWVuUxz1FZwgxkHeugtc2nHR89skFXSpU/edit#heading=h.smrrkri8x6b9

The WG meeting will start at 9 am PST. See you there!

Cheers,
Gilbert