Bernd Mathiske created MESOS-1667:
-------------------------------------
Summary: Extract from URI while downloading into work dir
Key: MESOS-1667
URL: https://issues.apache.org/jira/browse/MESOS-1667
Project: Mesos
Issue Type: Improvement
Components: slave
Affects Versions: 0.20.0
Environment: Every
Reporter: Bernd Mathiske
When the fetcher downloads an extractable archive, e.g. a tar file, it
currently downloads it completely and only then starts extracting from it. But
only the end result is needed for execution. Thus the space used for the
downloaded copy of the archive is wasted. This can become critical in case of
large archives.
The general idea to solve this issue is to perform the extraction while
downloading, and not storing intermediate results on disk. Possibly, this can
be achieved by arranging process pipes or by using some extraction library code
to stream the data through.
However, as a result of this, repeated downloading may always be called for,
whereas given an existing (https://reviews.apache.org/r/21316/) but not yet
committed patch for MESOS-336, the fetcher cache could just repeat the
extraction, without downloading more than once. Thus choosing in-stream
extraction might result in an overall performance loss. We should therefore
give users extra options in CommandInfo.URI to choose how to handle this.
In some cases, it could be possible to reuse the extracted assets directly,
also forgoing the repeat extraction. This could be handled with sym links. Then
extraction can happen during downloading and neither repeat downloading nor
repeat extraction occur. The user has to be conscious of the safety issue,
though, that any post-extraction modifications to the downloaded assets are
visible to subsequent tasks. So, an explicit flag in CommandInfo.UIR is called
for here, as well.
Ideally, this issue would be solved as a follow-up of MESOS-336, because some
of the described benefits depend on it.
--
This message was sent by Atlassian JIRA
(v6.2#6252)