Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 2:29 PM, Michał Górny wrote:
> On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> > On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> > > > Hello,
> > > > 
> > > > I have some runnable pseudocode outlining a faster tree verification
> > > > algorithm.
> > > > Before I create patches I'd like to see if there is any guidance on
> > > > making the
> > > > changes as unobtrusive as possible. If the radical change in algorithm
> > > > is
> > > > acceptable I can work on adding the changes.
> > > > 
> > > > Instead of composing any kind of structured data out of the portage
> > > > tree my
> > > > algorithm just lists all files and then optionally batches them out to
> > > > threads.
> > > > There is a noticeable speedup by eliding the tree traversal operations
> > > > which
> > > > can be seen when running the algorithm with a single thread and
> > > > comparing it to
> > > > the current algorithm in gemato (which should still be discussed
> > > > here?).
> > > 
> > > Without reading the code: does your algorithm correctly detect extraneous 
> > > files?
> > > 
> > 
> > Yes and no.
> > 
> > I am not sure why this is necessary. If the file does not appear in a 
> > manifest it is
> > ignored. It makes the most sense to me to put the burden of not including
> > untracked files on the publisher. If the user puts an untracked file into 
> > the tree it
> > will be ignored to no consequence; the authored files don't refer to it, 
> > after all.
> 
> This is necessary because a malicious third party can MITM you an rsync
> tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
> things on your system.  If you don't reject files not in Manifest, you
> open a huge security hole.
> 

Ok, I will refer to https://www.gentoo.org/glep/glep-0074.html and implement the
checks in detail, but will still need to spend some time looking for the best 
place
to insert the code.

I think it best to address this from two fronts. On one hand rejecting extra 
files
seems to have immediate benefit but the larger issue is portage exposing
untracked potentially malicious files to the user.

Has anything like a verity loopback filesystem been explored? It might reduce
duplication of work.

> > But it would be easy enough to build a second list of all files and compare 
> > it to
> > the list of files built from the manifests. If there are extras an error 
> > can be
> > generated. This is actually the first test I did on my manifest parsing 
> > code. I tried
> > to see if my tracked files roughly matched the total files in tree. That 
> > can be
> > repurposed for this check.
> > 
> > > > Some simple tests like counting all objects traversed and verified
> > > > returns the
> > > > same(ish). Once it is put into portage it could be tested in detail.
> > > > 
> > > > There is also my partial attempt at removing the brittle interface to
> > > > GnuPG
> > > > (it's not as if the current code is badly designed, just that parsing
> > > > the
> > > > output of GnuPG directly is likely not the best idea).
> > > 
> > > The 'brittle interface' is well-defined machine-readable output.
> > > 
> > 
> > Ok. I was aware there was a machine interface, but the classes that 
> > manipulate
> > a temporary GPG home seemed like not the best solution. I guess that is all
> > due to GPG assuming everything is in ~/.gnupg and keeping its state as a
> > directory structure.
> 
> A temporary home directory guarantees that user configuration does not
> affect the verification result.
> 

Yes, I know why it is there. The temporary construction of the directory is what
stood out to me as messy but I guess there is no way around it.

> > 
> > > > Needs gemato, dnspython, and requests. Slightly better than random code
> > > > because
> > > > I took inspiration from the existing gemato classes.
> > > 
> > > The code makes a lot of brittle assumptions about the structure. The 
> > > GLEP was specifically designed to avoid that and let us adjust the 
> > > structure in the future to meet our needs.
> > > 
> > 
> > These same assumptions are built into the code that operates on the
> > tree structure. If the GLEP were changed the existing code would also
> > potentially need changing. This code just uses the structure in a different
> > way.
> > 
> 
> The code that predates the GLEP, yes.  It will eventually be changed to
> be more flexible, especially when we can assume that we start removing
> backwards compatibility.
> 

My comment was more on the general nature of the code having to track
what is intended. Even if parsing code (involving the filesystem?) was
generated from a specification now the specification is "code" and still may
not reflect what is intended.

Taking liberties with the programmatic representation of the manifests
seems appropriate. I also don't see anything in the GLEP that dictates an
in-memory representation (and such a thing would 

Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Michał Górny
On Tue, 2020-06-30 at 12:50 -0500, Sid Spry wrote:
> On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> > Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> > > Hello,
> > > 
> > > I have some runnable pseudocode outlining a faster tree verification
> > > algorithm.
> > > Before I create patches I'd like to see if there is any guidance on
> > > making the
> > > changes as unobtrusive as possible. If the radical change in algorithm
> > > is
> > > acceptable I can work on adding the changes.
> > > 
> > > Instead of composing any kind of structured data out of the portage
> > > tree my
> > > algorithm just lists all files and then optionally batches them out to
> > > threads.
> > > There is a noticeable speedup by eliding the tree traversal operations
> > > which
> > > can be seen when running the algorithm with a single thread and
> > > comparing it to
> > > the current algorithm in gemato (which should still be discussed
> > > here?).
> > 
> > Without reading the code: does your algorithm correctly detect extraneous 
> > files?
> > 
> 
> Yes and no.
> 
> I am not sure why this is necessary. If the file does not appear in a 
> manifest it is
> ignored. It makes the most sense to me to put the burden of not including
> untracked files on the publisher. If the user puts an untracked file into the 
> tree it
> will be ignored to no consequence; the authored files don't refer to it, 
> after all.

This is necessary because a malicious third party can MITM you an rsync
tree with extraneous files (say, -r1 baselayout ebuild) that do horrible
things on your system.  If you don't reject files not in Manifest, you
open a huge security hole.

> But it would be easy enough to build a second list of all files and compare 
> it to
> the list of files built from the manifests. If there are extras an error can 
> be
> generated. This is actually the first test I did on my manifest parsing code. 
> I tried
> to see if my tracked files roughly matched the total files in tree. That can 
> be
> repurposed for this check.
> 
> > > Some simple tests like counting all objects traversed and verified
> > > returns the
> > > same(ish). Once it is put into portage it could be tested in detail.
> > > 
> > > There is also my partial attempt at removing the brittle interface to
> > > GnuPG
> > > (it's not as if the current code is badly designed, just that parsing
> > > the
> > > output of GnuPG directly is likely not the best idea).
> > 
> > The 'brittle interface' is well-defined machine-readable output.
> > 
> 
> Ok. I was aware there was a machine interface, but the classes that manipulate
> a temporary GPG home seemed like not the best solution. I guess that is all
> due to GPG assuming everything is in ~/.gnupg and keeping its state as a
> directory structure.

A temporary home directory guarantees that user configuration does not
affect the verification result.

> 
> > > Needs gemato, dnspython, and requests. Slightly better than random code
> > > because
> > > I took inspiration from the existing gemato classes.
> > 
> > The code makes a lot of brittle assumptions about the structure. The 
> > GLEP was specifically designed to avoid that and let us adjust the 
> > structure in the future to meet our needs.
> > 
> 
> These same assumptions are built into the code that operates on the
> tree structure. If the GLEP were changed the existing code would also
> potentially need changing. This code just uses the structure in a different
> way.
> 

The code that predates the GLEP, yes.  It will eventually be changed to
be more flexible, especially when we can assume that we start removing
backwards compatibility.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-portage-dev] Re: Speeding up Tree Verification

2020-06-30 Thread Zac Medico
On 6/30/20 10:29 AM, Sid Spry wrote:
> On Mon, Jun 29, 2020, at 9:34 PM, Zac Medico wrote:
>> On 6/29/20 7:15 PM, Sid Spry wrote:
>>> On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
 Hello,

 I have some runnable pseudocode outlining a faster tree verification 
 algorithm.
>>>
>>> Ah, right. It's worth noting that even faster than this algorithm is simply 
>>> verifying
>>> a .tar.xz. Is that totally off the table? I realize it doesn't fit every 
>>> usecase, but it
>>> seems to be faster in both sync and verification time.
>>
>> We've already got support for that with sync-type = webrsync. However, I
>> imagine sync-type = git is even better. All of the types are covered here:
>>
>> https://wiki.gentoo.org/wiki/Portage_Security
> 
> I'm being warned right now that webrsync-gpg is being deprecated; I've been 
> using
> it. It is, amazingly, faster than a typical rsync and may be faster than a 
> git pull though.

Yeah webrsync-gpg is deprecated but the replacement is sync-type =
webrsync and verification is enabled by default for that sync-type.
-- 
Thanks,
Zac



signature.asc
Description: OpenPGP digital signature


Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 1:20 AM, Fabian Groffen wrote:
> Hi,
> 
> On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > Hello,
> > 
> > I have some runnable pseudocode outlining a faster tree verification 
> > algorithm.
> > Before I create patches I'd like to see if there is any guidance on making 
> > the
> > changes as unobtrusive as possible. If the radical change in algorithm is
> > acceptable I can work on adding the changes.
> > 
> > Instead of composing any kind of structured data out of the portage tree my
> > algorithm just lists all files and then optionally batches them out to 
> > threads.
> > There is a noticeable speedup by eliding the tree traversal operations which
> > can be seen when running the algorithm with a single thread and comparing 
> > it to
> > the current algorithm in gemato (which should still be discussed here?).
> 
> I remember something that gemato used to use multiple threads, but
> because it totally saturated disk-IO, it was brought back to a single
> thread.  People were complaining about unusable systems.
> 

I think this is an argument for cgroups limits support on the portage process or
account as opposed to an argument against picking a better algorithm. That is
something I have been working towards, but I am only one man.

> In any case, can you share your performance results?  What speedup did
> you see, on warm and hot FS caches?  Which type of disk do you use?
> 

I ran all tests multiple times to make them warm off of a Samsung SSD, but
nothing very precise yet.

% gemato verify --openpgp-key signkey.asc /var/db/repos/gentoo
[...]
INFO:root:Verifying /var/db/repos/gentoo...
INFO:root:/var/db/repos/gentoo verified in 16.45 seconds

sometimes going higher, closer to 18s, vs.

% ./veriftree.py
4.763171965983929

So roughly an order of magnitude speedup without batching to threads.

> You could compare against qmanifest, which uses OpenMP-based
> paralllelism while verifying the tree.  On SSDs this does help.
> 

I lost my notes -- how do I specify to either gemato or qmanifest the GnuPG
directory? My code is partially structured as it is because I had problems doing
this. I rediscovered -K/--openpgp-key in gemato but am unsure for qmanifest.



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Tue, Jun 30, 2020, at 2:28 AM, Michał Górny wrote:
> Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
> >Hello,
> >
> >I have some runnable pseudocode outlining a faster tree verification
> >algorithm.
> >Before I create patches I'd like to see if there is any guidance on
> >making the
> >changes as unobtrusive as possible. If the radical change in algorithm
> >is
> >acceptable I can work on adding the changes.
> >
> >Instead of composing any kind of structured data out of the portage
> >tree my
> >algorithm just lists all files and then optionally batches them out to
> >threads.
> >There is a noticeable speedup by eliding the tree traversal operations
> >which
> >can be seen when running the algorithm with a single thread and
> >comparing it to
> >the current algorithm in gemato (which should still be discussed
> >here?).
> 
> Without reading the code: does your algorithm correctly detect extraneous 
> files?
> 

Yes and no.

I am not sure why this is necessary. If the file does not appear in a manifest 
it is
ignored. It makes the most sense to me to put the burden of not including
untracked files on the publisher. If the user puts an untracked file into the 
tree it
will be ignored to no consequence; the authored files don't refer to it, after 
all.

But it would be easy enough to build a second list of all files and compare it 
to
the list of files built from the manifests. If there are extras an error can be
generated. This is actually the first test I did on my manifest parsing code. I 
tried
to see if my tracked files roughly matched the total files in tree. That can be
repurposed for this check.

> >Some simple tests like counting all objects traversed and verified
> >returns the
> >same(ish). Once it is put into portage it could be tested in detail.
> >
> >There is also my partial attempt at removing the brittle interface to
> >GnuPG
> >(it's not as if the current code is badly designed, just that parsing
> >the
> >output of GnuPG directly is likely not the best idea).
> 
> The 'brittle interface' is well-defined machine-readable output.
>

Ok. I was aware there was a machine interface, but the classes that manipulate
a temporary GPG home seemed like not the best solution. I guess that is all
due to GPG assuming everything is in ~/.gnupg and keeping its state as a
directory structure.

> >
> >Needs gemato, dnspython, and requests. Slightly better than random code
> >because
> >I took inspiration from the existing gemato classes.
> 
> The code makes a lot of brittle assumptions about the structure. The 
> GLEP was specifically designed to avoid that and let us adjust the 
> structure in the future to meet our needs.
> 

These same assumptions are built into the code that operates on the
tree structure. If the GLEP were changed the existing code would also
potentially need changing. This code just uses the structure in a different
way.

I will admit my partial understanding of the entire GLEP. I made some
simplifications just to get something demonstrable done. However, please
consider removing or putting some of the checks elsewhere. I don't have
full suggestions right now, but there is the possibility of saving an
appreciable amount of time.



Re: [gentoo-portage-dev] Re: Speeding up Tree Verification

2020-06-30 Thread Sid Spry
On Mon, Jun 29, 2020, at 9:34 PM, Zac Medico wrote:
> On 6/29/20 7:15 PM, Sid Spry wrote:
> > On Mon, Jun 29, 2020, at 9:13 PM, Sid Spry wrote:
> >> Hello,
> >>
> >> I have some runnable pseudocode outlining a faster tree verification 
> >> algorithm.
> > 
> > Ah, right. It's worth noting that even faster than this algorithm is simply 
> > verifying
> > a .tar.xz. Is that totally off the table? I realize it doesn't fit every 
> > usecase, but it
> > seems to be faster in both sync and verification time.
> 
> We've already got support for that with sync-type = webrsync. However, I
> imagine sync-type = git is even better. All of the types are covered here:
> 
> https://wiki.gentoo.org/wiki/Portage_Security

I'm being warned right now that webrsync-gpg is being deprecated; I've been 
using
it. It is, amazingly, faster than a typical rsync and may be faster than a git 
pull though.

The issue with git is there are some analyses that indicate you shouldn't rely 
on git
for integrity, so you are back to verifying the tree on-disk, which is slower 
than
verifying the .tar.xz.

(To clarify: Even with signed commits the commit hashes could be attacked and 
this
is considered somewhat feasible.)



Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Pacho Ramos
El mar, 30-06-2020 a las 08:20 +0200, Fabian Groffen escribió:
> Hi,
> 
> On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> > Hello,
> > 
> > I have some runnable pseudocode outlining a faster tree verification
> > algorithm.
> > Before I create patches I'd like to see if there is any guidance on making
> > the
> > changes as unobtrusive as possible. If the radical change in algorithm is
> > acceptable I can work on adding the changes.
> > 
> > Instead of composing any kind of structured data out of the portage tree my
> > algorithm just lists all files and then optionally batches them out to
> > threads.
> > There is a noticeable speedup by eliding the tree traversal operations which
> > can be seen when running the algorithm with a single thread and comparing it
> > to
> > the current algorithm in gemato (which should still be discussed here?).
> 
> I remember something that gemato used to use multiple threads, but
> because it totally saturated disk-IO, it was brought back to a single
> thread.  People were complaining about unusable systems.
> 
> In any case, can you share your performance results?  What speedup did
> you see, on warm and hot FS caches?  Which type of disk do you use?
> 
> You could compare against qmanifest, which uses OpenMP-based
> paralllelism while verifying the tree.  On SSDs this does help.
> 
> Thanks,
> Fabian

I am talking only based on my own experience but, at least on my case, there are
noticeable differences between I/O schedulers. That is something that maybe
could affect that tests... with latest kernels we only have noop, deadline
(default) and bfq... in my case bfq works far better for keeping the system
responsible (on both, rotational and SSDs). But you can test different
combinations to see the one that fits better for you.



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Michał Górny
Dnia June 30, 2020 6:20:37 AM UTC, Fabian Groffen  
napisał(a):
>Hi,
>
>On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
>> Hello,
>> 
>> I have some runnable pseudocode outlining a faster tree verification
>algorithm.
>> Before I create patches I'd like to see if there is any guidance on
>making the
>> changes as unobtrusive as possible. If the radical change in
>algorithm is
>> acceptable I can work on adding the changes.
>> 
>> Instead of composing any kind of structured data out of the portage
>tree my
>> algorithm just lists all files and then optionally batches them out
>to threads.
>> There is a noticeable speedup by eliding the tree traversal
>operations which
>> can be seen when running the algorithm with a single thread and
>comparing it to
>> the current algorithm in gemato (which should still be discussed
>here?).
>
>I remember something that gemato used to use multiple threads, but
>because it totally saturated disk-IO, it was brought back to a single
>thread.  People were complaining about unusable systems.

No, it gave significant speedup even on spinning HDDs. However, it hang on some 
people due to some bug that I couldn't reproduce.

>
>In any case, can you share your performance results?  What speedup did
>you see, on warm and hot FS caches?  Which type of disk do you use?
>
>You could compare against qmanifest, which uses OpenMP-based
>paralllelism while verifying the tree.  On SSDs this does help.
>
>Thanks,
>Fabian
>
>> 
>> Some simple tests like counting all objects traversed and verified
>returns the
>> same(ish). Once it is put into portage it could be tested in detail.
>> 
>> There is also my partial attempt at removing the brittle interface to
>GnuPG
>> (it's not as if the current code is badly designed, just that parsing
>the
>> output of GnuPG directly is likely not the best idea).
>> 
>> Needs gemato, dnspython, and requests. Slightly better than random
>code because
>> I took inspiration from the existing gemato classes.
>> 
>> ```python (veriftree.py)
>> #!/usr/bin/env python3
>> import os, sys, zlib, hashlib, tempfile, shutil, timeit
>> import subprocess
>> from typing import List
>> from pprint import pprint
>> 
>> from gemato.manifest import (
>> ManifestFile,
>> ManifestFileEntry,
>> )
>> from wkd import (
>> check_domain_signature,
>> hash_localpart,
>> build_web_key_uri,
>> stream_to_file
>> )
>> from fetchmedia import (
>> OpenPGPEnvironment,
>> setup_verification_environment
>> )
>> 
>> # 0. Top level directory (repository) contains Manifest, a PGP
>signature of
>> #blake2b and sha512 hashes of Manifest.files.gz.
>> # 1. Manifest.files contains hashes of each category Manifest.gz.
>> # 2. The category Manifest contains hashes of each package Manifest.
>> # 3. The package Manifest contains hashes of each package file.
>> #Must be aware of PMS, e.g. aux tag specifies a file in files/.
>> 
>> # 0. Check signature of repo Manifest.
>> # 1. Merge items in Manifest.files, each category Manifest, and each
>package
>> #Manifest into one big list. The path must be made absolute.
>> # 2. Distribute items to threads.
>> 
>> # To check operation compare directory tree to files appearing in all
>> # ManifestRecords.
>> 
>> class ManifestTree(object):
>> __slots__ = ['_directory', '_manifest_list', '_manifest_records',
>> '_manifest_results']
>> 
>> def __init__(self, directory: str):
>> self._directory = directory
>> # Tuples of (base_path, full_path).
>> self._manifest_list = []
>> self._manifest_records = []
>> self._manifest_results = []
>> 
>> def build_manifest_list(self):
>> for path, dirs, files in os.walk(self._directory):
>> #if 'glsa' in path or 'news' in path:
>> #if 'metadata' in path:
>> #continue # Skip the metadata directory for now.
>> # It contains a repository. Current algo barfs on
>Manifest
>> # containing only sig.
>> 
>> if 'Manifest.files.gz' in files:
>> self._manifest_list += [(path, path +
>'/Manifest.files.gz')]
>> if 'Manifest.gz' in files:
>> self._manifest_list += [(path, path +
>'/Manifest.gz')]
>> 
>> if path == self._directory:
>> continue # Skip the repo manifest. Order matters, fix
>eventually.
>> if 'Manifest' in files:
>> self._manifest_list += [(path, path + '/Manifest')]
>> 
>> def parse_manifests(self):
>> td = tempfile.TemporaryDirectory(dir='./')
>> for manifest in self._manifest_list:
>> def inner():
>> if manifest[1].endswith('.gz'):
>> name = 'Manifest.files' # Need to also handle
>Manifest.gz.
>> path = '{0}/{1}'.format(td.name, name)
>> subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
>> 

Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Michał Górny
Dnia June 30, 2020 2:13:43 AM UTC, Sid Spry  napisał(a):
>Hello,
>
>I have some runnable pseudocode outlining a faster tree verification
>algorithm.
>Before I create patches I'd like to see if there is any guidance on
>making the
>changes as unobtrusive as possible. If the radical change in algorithm
>is
>acceptable I can work on adding the changes.
>
>Instead of composing any kind of structured data out of the portage
>tree my
>algorithm just lists all files and then optionally batches them out to
>threads.
>There is a noticeable speedup by eliding the tree traversal operations
>which
>can be seen when running the algorithm with a single thread and
>comparing it to
>the current algorithm in gemato (which should still be discussed
>here?).

Without reading the code: does your algorithm correctly detect extraneous files?

>
>Some simple tests like counting all objects traversed and verified
>returns the
>same(ish). Once it is put into portage it could be tested in detail.
>
>There is also my partial attempt at removing the brittle interface to
>GnuPG
>(it's not as if the current code is badly designed, just that parsing
>the
>output of GnuPG directly is likely not the best idea).

The 'brittle interface' is well-defined machine-readable output.

>
>Needs gemato, dnspython, and requests. Slightly better than random code
>because
>I took inspiration from the existing gemato classes.

The code makes a lot of brittle assumptions about the structure. The GLEP was 
specifically designed to avoid that and let us adjust the structure in the 
future to meet our needs.

>
>```python (veriftree.py)
>#!/usr/bin/env python3
>import os, sys, zlib, hashlib, tempfile, shutil, timeit
>import subprocess
>from typing import List
>from pprint import pprint
>
>from gemato.manifest import (
>ManifestFile,
>ManifestFileEntry,
>)
>from wkd import (
>check_domain_signature,
>hash_localpart,
>build_web_key_uri,
>stream_to_file
>)
>from fetchmedia import (
>OpenPGPEnvironment,
>setup_verification_environment
>)
>
># 0. Top level directory (repository) contains Manifest, a PGP
>signature of
>#blake2b and sha512 hashes of Manifest.files.gz.
># 1. Manifest.files contains hashes of each category Manifest.gz.
># 2. The category Manifest contains hashes of each package Manifest.
># 3. The package Manifest contains hashes of each package file.
>#Must be aware of PMS, e.g. aux tag specifies a file in files/.
>
># 0. Check signature of repo Manifest.
># 1. Merge items in Manifest.files, each category Manifest, and each
>package
>#Manifest into one big list. The path must be made absolute.
># 2. Distribute items to threads.
>
># To check operation compare directory tree to files appearing in all
># ManifestRecords.
>
>class ManifestTree(object):
>__slots__ = ['_directory', '_manifest_list', '_manifest_records',
>'_manifest_results']
>
>def __init__(self, directory: str):
>self._directory = directory
># Tuples of (base_path, full_path).
>self._manifest_list = []
>self._manifest_records = []
>self._manifest_results = []
>
>def build_manifest_list(self):
>for path, dirs, files in os.walk(self._directory):
>#if 'glsa' in path or 'news' in path:
>#if 'metadata' in path:
>#continue # Skip the metadata directory for now.
> # It contains a repository. Current algo barfs on Manifest
># containing only sig.
>
>if 'Manifest.files.gz' in files:
>   self._manifest_list += [(path, path + '/Manifest.files.gz')]
>if 'Manifest.gz' in files:
>self._manifest_list += [(path, path + '/Manifest.gz')]
>
>if path == self._directory:
>  continue # Skip the repo manifest. Order matters, fix eventually.
>if 'Manifest' in files:
>self._manifest_list += [(path, path + '/Manifest')]
>
>def parse_manifests(self):
>td = tempfile.TemporaryDirectory(dir='./')
>for manifest in self._manifest_list:
>def inner():
>if manifest[1].endswith('.gz'):
> name = 'Manifest.files' # Need to also handle Manifest.gz.
>path = '{0}/{1}'.format(td.name, name)
>subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
>.format(manifest[1], path)])
>for line in open(path):
>mr = ManifestRecord(line)
>mr.make_absolute(manifest[0])
>self._manifest_records += [mr]
>else:
>for line in open(manifest[1]):
>if line.startswith('-'):
>return # Skip the signed manifest.
>mr = ManifestRecord(line)
>mr.make_absolute(manifest[0])
>self._manifest_records += [mr]
>   

Re: [gentoo-portage-dev] Speeding up Tree Verification

2020-06-30 Thread Fabian Groffen
Hi,

On 29-06-2020 21:13:43 -0500, Sid Spry wrote:
> Hello,
> 
> I have some runnable pseudocode outlining a faster tree verification 
> algorithm.
> Before I create patches I'd like to see if there is any guidance on making the
> changes as unobtrusive as possible. If the radical change in algorithm is
> acceptable I can work on adding the changes.
> 
> Instead of composing any kind of structured data out of the portage tree my
> algorithm just lists all files and then optionally batches them out to 
> threads.
> There is a noticeable speedup by eliding the tree traversal operations which
> can be seen when running the algorithm with a single thread and comparing it 
> to
> the current algorithm in gemato (which should still be discussed here?).

I remember something that gemato used to use multiple threads, but
because it totally saturated disk-IO, it was brought back to a single
thread.  People were complaining about unusable systems.

In any case, can you share your performance results?  What speedup did
you see, on warm and hot FS caches?  Which type of disk do you use?

You could compare against qmanifest, which uses OpenMP-based
paralllelism while verifying the tree.  On SSDs this does help.

Thanks,
Fabian

> 
> Some simple tests like counting all objects traversed and verified returns the
> same(ish). Once it is put into portage it could be tested in detail.
> 
> There is also my partial attempt at removing the brittle interface to GnuPG
> (it's not as if the current code is badly designed, just that parsing the
> output of GnuPG directly is likely not the best idea).
> 
> Needs gemato, dnspython, and requests. Slightly better than random code 
> because
> I took inspiration from the existing gemato classes.
> 
> ```python (veriftree.py)
> #!/usr/bin/env python3
> import os, sys, zlib, hashlib, tempfile, shutil, timeit
> import subprocess
> from typing import List
> from pprint import pprint
> 
> from gemato.manifest import (
> ManifestFile,
> ManifestFileEntry,
> )
> from wkd import (
> check_domain_signature,
> hash_localpart,
> build_web_key_uri,
> stream_to_file
> )
> from fetchmedia import (
> OpenPGPEnvironment,
> setup_verification_environment
> )
> 
> # 0. Top level directory (repository) contains Manifest, a PGP signature of
> #blake2b and sha512 hashes of Manifest.files.gz.
> # 1. Manifest.files contains hashes of each category Manifest.gz.
> # 2. The category Manifest contains hashes of each package Manifest.
> # 3. The package Manifest contains hashes of each package file.
> #Must be aware of PMS, e.g. aux tag specifies a file in files/.
> 
> # 0. Check signature of repo Manifest.
> # 1. Merge items in Manifest.files, each category Manifest, and each package
> #Manifest into one big list. The path must be made absolute.
> # 2. Distribute items to threads.
> 
> # To check operation compare directory tree to files appearing in all
> # ManifestRecords.
> 
> class ManifestTree(object):
> __slots__ = ['_directory', '_manifest_list', '_manifest_records',
> '_manifest_results']
> 
> def __init__(self, directory: str):
> self._directory = directory
> # Tuples of (base_path, full_path).
> self._manifest_list = []
> self._manifest_records = []
> self._manifest_results = []
> 
> def build_manifest_list(self):
> for path, dirs, files in os.walk(self._directory):
> #if 'glsa' in path or 'news' in path:
> #if 'metadata' in path:
> #continue # Skip the metadata directory for now.
> # It contains a repository. Current algo barfs on Manifest
> # containing only sig.
> 
> if 'Manifest.files.gz' in files:
> self._manifest_list += [(path, path + '/Manifest.files.gz')]
> if 'Manifest.gz' in files:
> self._manifest_list += [(path, path + '/Manifest.gz')]
> 
> if path == self._directory:
> continue # Skip the repo manifest. Order matters, fix 
> eventually.
> if 'Manifest' in files:
> self._manifest_list += [(path, path + '/Manifest')]
> 
> def parse_manifests(self):
> td = tempfile.TemporaryDirectory(dir='./')
> for manifest in self._manifest_list:
> def inner():
> if manifest[1].endswith('.gz'):
> name = 'Manifest.files' # Need to also handle Manifest.gz.
> path = '{0}/{1}'.format(td.name, name)
> subprocess.run(['sh', '-c', 'gunzip -c {0} > {1}'
> .format(manifest[1], path)])
> for line in open(path):
> mr = ManifestRecord(line)
> mr.make_absolute(manifest[0])
> self._manifest_records += [mr]
> else:
> for line in open(manifest[1]):
>