Re: Software Management call for RFEs

2013-05-27 Thread Zdenek Pavlas
 And there package diffs, which are ed-style diffs of the
 Packages file I mentioned above.  This approach would work quite well
 for primary.xml because it doesn't contain cross-references between
 packages using non-natural keys.  It doesn't work for the SQLite
 database, either in binary or SQL dump format, because of the reliance
 on artificial primary keys (such as package IDs).

I've once tried this. With about 10k packages in fedora-updates, the delta
over 2-3 days was +491 -479. Assuming deletions are cheap, the delta should
ideally be 5%. As expected, binary bsddiff yields much bigger (~29%) delta.

Very roughly, it's 5% that really describe new packages, plus an almost
constant 24% overhead to fix up the inevitable changes in surrogate keys.
Not as bad as I was afraid, but still not worth it (IMO).

So, we need *.xml deltas.  Yum can rebuild xml = .sqlite locally, but
this needs quite a lot of memory and takes TENS of seconds.  Add the time
needed to patch the quite large uncompressed xml file, and suddenly the
fact that you're downloading just 1/10th of data hardly pays off
(ignoring very specific use cases, like mobile data for a moment)

For DNF, it's different.  It has to rebuild xml = .solv anyway, so this
comes for free.

 However, for many users that follow unstable or testing, package diffs
 are currently slower than downloading the full Packages file because the
 diffs are incremental (i.e., they contain the changes from file version
 N to N+1, and you have to apply all of them to get to the current
 version) and apt-get can easily write 100 MB or more because the
 Packages file is rewritten locally multiple times.

Yes, patch chaining should be avoided.  I'd like to use N = 1 deltas,
that could be applied to many recent snapshots.
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: Software Management call for RFEs

2013-05-27 Thread Zdenek Pavlas
 Can you point me to the primary.xml - SQLite translation in yum?  I've
 got a fairly efficient primary.xml parser.

Just set mddownloadpolicy=xml in yum.conf.  It should work, but
since downloading sqlite.bz2 is much better, very few use this.
Yum uses fairly efficient parser, written in C, using libxml2 
(the yum-metadata-parser package). It's always bundled, because
Yum has to support xml-only repositories anyway.

Oh, there's a typo in yum.conf.5 .. fixed.

 It might be interesting to
 see if it's possible to reduce the latency introduced by the SQLite
 conversion to close to zero.  (Decompression and INSERTs can be
 interleaved with downloading, and maybe the index creation improvements
 in SQLite are sufficient these days.)

We have to checksum the downloaded data before processing, and this
kills pipelining. Also, when updating primary_db with a bunch of INSERTs 
and DELETEs, your database differs from the one on server:

- different *.sqlite checksum
- different pkgId = PkgKey mapping
- different order of packages from SELECTs

For speed, Yum joins primary_db and filelists_db via pkgKey,
so #2 breaks Yum, unless you always download/delta-update both-
so this kills the win in we don't need filelists case.
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Yum: faster bash completion of available packages

2013-04-05 Thread Zdenek Pavlas
Hi,

People were complaining that yum autocompletion of package names
is very slow (Bug 919852).  Now there's a shortcut that (if enabled)
makes it much faster.  However, the behavior changes slightly:

- disabled but cached repositories are used
- configured package excludes are not honored
- all arches that are available are suggested
- there's no installed/available package split

It's now in F19 and rawhide, but DISABLED, due to the above.
To enable it, just set YUM_CACHEDIR env variable to the path
where repository metadata are stored- usually /var/cache/yum.

Feedback is very welcome, so we can decide whether to scrap
this or enable by default.

Thanks!
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: Problem with F17 yum requesting old repodata?

2012-11-27 Thread Zdenek Pavlas
 /pub/fedora/linux/updates/17/x86_64/repodata/34881e74623de1754bf0e12f01884bea615fdcee05eded189f22d41bf9d4260b-other.sqlite.bz2
 
 That file no longer exists; it was on my mirror from 21:12 Monday to
 22:17 Tuesday.

 I'm guessing there's either a yum bug or a
 broken repo push that is:
 
 - causing yum to try to fetch an out-of-date repodata file

That's puzzling.  Repodata are fetched in two steps:
1) repomd.xml  primary.sqlite: when repo is opened
2) filelists  other: loaded on demand

There's usually just a very small delay between 1) and 2),
and 1) may use metadata at most 6 hours old.

When something requests otherdata that have expired 3 days
ago, this means that either:

- metadata_expire option was changed in yum.conf
- or, it's some long running application.

 - causing it to try over and over again rapidly

AFAIK, Yum (cli) does not need other.sqlite.  Maybe something
runs repoquery --changelog somepackage in a retry loop..

Or, maybe it's PackageKit, trying to fetch changelogs
of updated packages, without updating primary metadata first?
That would fit the long running app guess.
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: Problem with F17 yum requesting old repodata?

2012-11-26 Thread Zdenek Pavlas
 This is what I get for the last few days:
 # yum --skip-broken update

This is a fairly common scenario.  Let me explain..
Repository cachecookie was older than 6 hours,
Yum updates the metalink.xml file:

 updates/17/x86_64/metalink  |  16 kB 00:00

The repomd.xml timestamp stored in metalink.xml has changed.
Yum retrieves new repomd.xml from some mirror:

 updates | 4.7 kB 00:00

Then, $hash-primary.sqlite.bz2 referenced by the new repomd.xml
needs to be retrieved.  But 1st mirror Yum tried didn't have it:

 http://ftp.informatik.uni-frankfurt.de/fedora/updates/17/x86_64/repodata/00c7410a78aa8dd0f4934ed4935377b99e0339101cee369c1b1691f3025950ac-primary.sqlite.bz2:
 [Errno 14] HTTP Error 404 - Not Found

Yum tries the same relative URL on other mirror.
This time the DL was successful:

 Trying other mirror.
 updates/primary_db  | 6.9 MB 00:06

Metadata are pushed to mirrors as independent files.  Probably
the tiny repomd.xml is way ahead of primary.sqlite.bz2,
so there's a race possible.

But since we handle it (by trying other mirror, or reverting
to previous metadata when all mirrors fail), I don't consider
this a bug.
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

re: request: remove gd.tuwien.ac.at from mirror-lists

2012-10-03 Thread Zdenek Pavlas
On Wed Sep 26 08:02:05, Reindl Harald wrote:

 yes, and that is why statistics of mirrors are meaningsless
 because of this fact my idea to give us a config-option
 dear yum, if the selected mirror provides lower than
 500 KB/sek try another one because my line can 12 MB/sec

FWIW, we've increased the low speed limit in urlgrabber from 
1 to 1000 B/sec.  This should fix the most pathological cases.
When speed falls below this limit for 30s, download is aborted 
as if it timed out, and next mirror is used.  Each timeout also
halves the mirror's estimated speed so it will very likely be
avoided next time.

https://admin.fedoraproject.org/updates/FEDORA-2012-14928/python-urlgrabber-3.9.1-21.fc18

The timeout value (30s by default) could be adjusted in yum.conf.
Low speed limit is hardcoded, but there's a simple patch to add
it to yum.conf.. could be merged if necessary.

http://lists.baseurl.org/pipermail/yum-devel/2012-September/009634.html
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

yum pragma: no-cache (was Re: F18 DNF and history)

2012-06-22 Thread Zdenek Pavlas
 From: Nicolas Mailhot nicolas.mail...@laposte.net
 can we get a package downloader that sends the correct 
 cache-control http headers to refresh data automatically
 instead of complaining metadata is wrong and aborting
 (for people behind a caching proxy)?

Have you tried changing http_caching option in yum.conf
from (default) 'all' to 'packages'?
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: [announce] yum: parallel downloading

2012-05-22 Thread Zdenek Pavlas
 I'd be happy if yum/urlgrabber/libcurl finally used http keepalives.

It does, indeed. Parallel downloader tries to use keepalives, too.
(we cache and reuse the last idle process instead of closing it)

 Last time I looked (and it has been a while), it didn't, so you
 always paid the TCP slow startup penalty for each package.

/me just checked with tcpflow that we really do.
Please, contact me off-list if you can reproduce it. Thanks!
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: [announce] yum: parallel downloading

2012-05-21 Thread Zdenek Pavlas
Hi Glen,

 Why is the default three connections rather than one? Is a tripling
 of the number of connections to a mirror on a Fedora release day
 desirable?

$ grep maxconnections /var/cache/yum/*/metalink.xml
/var/cache/yum/fedora/metalink.xml:  resources maxconnections=1
/var/cache/yum/updates/metalink.xml:  resources maxconnections=1

Yum understands this.

 Consider that a large mirror site already sees concurrent connections
 in the multiple 10,000s.

Three connections limit is used when the above is not available
(e.g. a baseurl setup with just one mirror).  I don't mind lowering
it to just two, as that should work good enough in most cases.
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: [announce] yum: parallel downloading

2012-05-21 Thread Zdenek Pavlas
 The number of concurrent users is now lower because, well, each of
 them now completes a yum update in one third of the time.

I think Glen's concerns were that the consumed resources
(flow caches, TCP hash entries, sockets) may scale faster
than the aggregated downloading speed.

I am aware of this, and in most cases the downloader in urlgrabber
will make just 1 concurrent connection to a mirror, because:

1) The Nth concurrent connection to the same host is assumed
   to be N times slower than 1st one, so we'll very likely
   not select the same mirror again.

2) maxconnections=1 in every metalink I've seen so far.
   This is a hard limit, we block until a download finishes
   and reuse one connection when the limit is maxed out.

The reason for NOT banning 1 connections to the same host altogether
is that (as John Reiser wrote) 2nd connection does help quite a lot
when downloading many small files and just one mirror is available.
I agree that using strictly 1 connection and HTTP pipelining would 
be even better, but we can't do that with libcurl.

--
Zdenek
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: [announce] yum: parallel downloading

2012-05-17 Thread Zdenek Pavlas
  Both packages are compatible with older versions.
 
 Can we use them in Fedora 17 too ?

Yes, I've used it in F14 for some time.

--
Zdeněk Pavlas
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

Re: [announce] yum: parallel downloading

2012-05-17 Thread Zdenek Pavlas
 So disable fastestmirror plugin before testing this,
 would be the way to go?

The fastestmirror plugin does some initial mirror sorting.  
We mostly ignore this, so disabling fastestmirror makes sense
but is not strictly necessary.

--
Zdeněk Pavlas
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel

[announce] yum: parallel downloading

2012-05-16 Thread Zdenek Pavlas
Hi,

A new yum and urlgrabber packages have just hit Rawhide.  These releases
include some new features, including parallel downloading of packages and
metadata, and a new mirror selection code.  As we plan to include these
features in RHEL7, I welcome any feedback or bug reports!

python-urlgrabber-3.9.1-12.fc18 supports a new API to urlgab() files in
parallel, and yum-3.4.3-26.fc18 can use this.  Both packages are compatible
with older versions.

Feature list:

- parallel downloading of packages and metadata

If possible, multiple files are downloaded in parallel.  (see below for the
limitations that apply)

- configurable 'max_connections' limit in yum.conf

This is the maximum number of simultaneous connections Yum makes.  Purpose of
this is to limit local resources (number of processes forked).  The default is
to use urlgrabber's default value of 5.

- mirror limits are honored, too.

Making many connections to the same mirror usually does not help much, it just
consumes more resources.  That's why Yum also uses mirror limits from
metalink.xml.  If no such limit is available, at most 3 simultaneous
connections are made to any single mirror.

- new mirror selection algorithm

The real downloading speed is calculated after each download, and the mirror's
statistics get updated.  These are in turn used when selecting mirrors for
further downloads.  This should be more accurate than measuring latencies in
fastestmirror plugin, but slow mirrors now have to be tried from time to time,
and the statistics need some time to build up.

- ctrl-c handling

This is a long-standing problem in Yum.  Due to various shortcomings in rpm and
curl it's impossible to react immediately to SIGINT.  But now the downloader 
runs in a different process, so we can exit even if curl is still stuck.
The skip to next mirror feature is gone (we don't want to restart all
currently running downloads).

Known limitations:

- metalink.xml and repomd.xml downloads are not parallelized yet.


--
Zdeněk Pavlas
-- 
devel mailing list
devel@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/devel