On 06/12/2018 02:27 PM, Lukáš Doktor wrote:
> Hello guys,
> 
> As Cleber pointed out on the release meeting, we are struggling with our 
> asset fetcher. It was designed with one goal, to cache arbitrary files by 
> name and it fails when different asset use the same name as it simply assumes 
> they are the same thing.
> 
> Let's have a look at this example:
> 
>     fetch("https://www.example.org/foo.zip";)
>     fetch("https://www.example.org/bar.zip";)
>     fetch("http://www.example.org/foo.zip";)
> 
> Currently the third fetch simply uses the "foo.zip" downloaded from "https" 
> even though it could be a completely different file (or downloaded from 
> completely different url). This is good and bad. It's good when you're 
> downloading **any** "ltp.tar.bz2", or **any** "netperf.zip", but if you are 
> downloading "vmlinuz" which is always called "vmlinuz" but comes from a 
> different subdirectory, it might lead to big problems.
> 

Right, all your observations so far are correct.

> From this I can see two mods of assets, anonymous and specific. Instead of 
> trying to detect this based on combinations of hashes and methods, I'd 
> suggest being explicit and either add it as extra argument, or even create 
> new class `AnonymousAsset` and `SpecificAsset`, where `AnonymousAsset` would 
> be the current implementation and we still need to decide on `SpecificAsset` 
> implementation. Let's discuss some approaches and use following assets in 
> examples:

Given that the `fetch_asset()` is the one interface test writers are
encouraged to use, having different classes is really an implementation
detail at this point.  For this discussion, I do see value in naming the
different modes though.

> 
> 
> Current implementation
> ----------------------
> 
> Current implementation is Anonymous and the last one simply returns the 
> "foo.zip" fetched in first fetch.
> 
> Result:
> 
>     foo.zip
>     bar.zip    # This one is fetched from "https"
> 
> 
> + simplest
> - leads to clashes
> 
> 

No need to say that this is where we're moving away from.

> Hashed url dir
> --------------
> 
> I can see multiple options. Cleber proposed in 
> https://github.com/avocado-framework/avocado/pull/2652 to create in such case 
> dir based "hash(url)" and store all assets of given url there. It seems to be 
> fairly simple to develop and maintain, but the cache might become hard to 
> upkeep and there is non-zero possibility of clashes (but nearly limiting to 
> zero).
> 

Definitely my goals were to keep it simple, and to fix the one issue at
hand.  My motivations are based on the idea that Avocado should be able
to help test writers with common tasks in tests, but it's hardly going
to contain in itself full-blown and complete solutions to generic
problems, downloading and caching files.

Now, I'd be nice to describe the possible clashes (besides, of course
possible hash collisions).  I can think of one:

 * Anonymous with url "http://foo/bar.zip";
 * "Specific" with url "http://baz/4b3d163d2565966cf554ccb6e81f31f8695ceb54";

The clash exists because sha1("http://foo/bar.zip";) is
4b3d163d2565966cf554ccb6e81f31f8695ceb54.

> Another problem would be concurrent access as we might start downloading file 
> with the same name as url dir and all kind of different clashes and we'll 
> only find our all the issues when people start extensively using this.

I'm not sure I follow.  Right now the name of the *downloaded* files are
temporary names.  Then, a lock for the destination name is acquired,
verification is done, and if successful, copied into the destination name.

> 
> Result:
> 
>     2e3d2775159c4fbffa318aad8f9d33947a584a43/foo.zip    # Fetched from https
>     2e3d2775159c4fbffa318aad8f9d33947a584a43/bar.zip
>     6386f4b6490baddddf8540a3dbe65d0f301d0e50/foo.zip    # Fetched from http
> 
> + simple to develop
> + simple to maintain

Agreed.

> - possible clashes

Agreed, ideally there should be none.

> - hard to browse manually

True.  As a note, I'd trade the ease to browse the cache for a simpler
overall implementation, because the "browsing" could then be done with
simple scripts reusing the same implementation (like a contrib script
importing from "avocado.utils.asset").

> - API changes might lead to unusable files (users would have to remove files 
> manually)

This too, but I really think this is minor.  We may want to include the
cache stability along the same lines of the API stability (on LTS
releases) but I don't think we made promises about it yet.

> 
> 
> sqlite
> ------
> 
> Another approach would be to create sqlite database in every cache-dir. For 
> anonymous assets nothing would change, but for specific assets we'd create a 
> new tmpdir per given asset and store the mapping in the database.
> 
> Result:
> 
>     .avocado_cache.sqlite
>     foo-1X3s/foo.zip
>     bar-3s2a/bar.zip
>     foo-t3d2/foo.zip
> 
> where ".avocado_cache.sqlite" would contain:
> 
>     https://www.example.org/foo.zip  foo-1X3s/foo.zip
>     https://www.example.org/bar.zip  bar-3s2a/bar.zip
>     http://www.example.org/foo.zip   foo-t3d2/foo.zip
> 
> Obviously by creating a db we could improve many things. First example would 
> be to store expiration date and based on last access to db we could run 
> cache-dir upkeep, removing outdated assets.
> 
> Another improvement would be to store the downloaded asset hash and 
> re-download&update hash when the file was modified even when user didn't 
> provided hash.
> 
> + easy to browse manually
> + should be simple to expand the features (upkeep, security, ...)
> + should simplify locks as we can atomically move the downloaded file&update 
> db. Even crashes should lead to predictable behavior
> - slightly more complicated to develop
> - "db" file would have to be protected
> 
> 

I'm a bit skeptical about the "improvements" here, not because they are
not valid features, but because we don't need them at this point.  I
don't think future possibilities should shape too much of the immediate
architecture decisions.

> Other solutions
> ---------------
> 
> There are many other solutions like using `$basename-$url_hash` as the name 
> or using `astring.string_to_safe_path` instead of url_hash and so on. We are 
> open to suggestions.
> 
> 

I think the one big problem we have identified is the clash between
"anonymous" and "specific" asset.  My proposal to fix that would be to
define the following structure:

 base_writable_cache_dir
   ├── by_name    <- "specific" verified files
   ├── by_origin  <- "anonymous" verified files
   └── download   <- temporary base dir for downloads

> Questions
> =========
> 
> There are basically two questions:
> 
> 1. Do we want to explicitly set the mode (anonymous/specific), in which way 
> and how to call them

As an API designed, I don't think that difference is worth being
apparent to users.  As a user, I'd expect to either pass the name of the
file, or one specific URL, with or without a verification hash.

> 2. Which implementation we want to use (are there existing solutions we can 
> simply use?)
> 

I've did some research, and I don't ready to use code we could
incorporate.  At this point, I'd choose the simplest solution we can
write, that solves the issue at hand.

-- 
Cleber Rosa
[ Sr Software Engineer - Virtualization Team - Red Hat ]
[ Avocado Test Framework - avocado-framework.github.io ]
[  7ABB 96EB 8B46 B94D 5E0F  E9BB 657E 8D33 A5F2 09F3  ]

Reply via email to