[gentoo-dev] [RFC] euscan: Need to add more upstream info in metadata.xml

2012-08-10 Thread Federico fox Scrinzi
Hi everybody!

euscan is available in portage as a dev package
(app-portage/euscan-). This tool allows to check if a given
package/ebuild has new upstream versions or not. It uses different
heuristics to scan upstream and grab new versions and related urls.

euscan can use either custom handlers for well known upstream (github,
pypi, cpan, sourceforge, google-code, etc..) or use directory scanning
using SRC_URI. If directory scan fails for some reason, euscan will
fallback to brute force (generating possible next version number and
trying to fetch those packages).

The problem that we're facing with euscan is that some packages in
upstream use strange version numbers or the list of available versions
is placed in a location that is totally different from SRC_URI.

Examples:
- MySQL: most MySQL mirrors are not browsable (always fallback to brute
force)
- webalizer uses strange version numbers in upstream
(ftp://ftp.mrunix.net/pub/webalizer/), in this case euscan should be
aware that 2.21-02 is the version number in upstream and scan the ftp
directory searching for webalizer-(\d+).(\d+)-(\d+).tar.gz. The last
version of webalizer, 2.23.05, is not recognized by euscan and is not
available in gentoo.
- Authen-SASL-Cyrus in upstream uses “-server” in version numbers
http://www.cpan.org/authors/id/P/PB/PBOETTCH/
- XML-Tidy that uses stranges letters in version number


We thought about how to solve this issue and we agreed that the best way
to handle the problem for every specific case was adding some more
information in metadata.xml.

In Debian, uscan uses information from debian/watch inside debian
packages, hence as so much work is already done we thought about taking
this info from watch files and save it in metadata.xml to make euscan
use it.

I wrote a simple script that patches metadata.xml adding an experimental
watch tag with data from debian packages:
https://github.com/volpino/euscan/blob/master/bin/euscan_patch_metadata

A basic watch data contains a base url to scan and a pattern to search
into it:
Example:
 base: http://icedtea.classpath.org/download/source/
 pattern: icedtea-([\d\.]+).tar.gz
Which means open that url and search for the links that match that
pattern.
This is useful for example when is not possible to retrieve the base url
from SRC_URI (icedtea’s SRC_URI is
http://icedtea.classpath.org/hg/release/icedtea7-forest-2.2/hotspot/archive/889dffcf4a54.tar.gz)

Advanced usage with directory pattern:
Example:
 base: http://ftp.gwdg.de/pub/misc/mysql/Downloads/MySQL-([\d\.]+)
 pattern: mysql-([\d\.]+).tar.gz
Scans all directories that match the query looking for links that match
the pattern

We need also some options for mangling versions and download url: these
options can contain regexps or names of mangling rules (e.g.: cpan
means apply mangling rules for CPAN versions)

Version mangling example:
As mentioned above webalizer uses both dots and hyphens in version
numbers, so an option like this is required versionmangle=”s/-/./”

Download url mangling example:
Page scan on berlios returns an url like this:
http://prdownload.berlios.de/mirageiv/mirage-0.9.tar.gz that should be
mangled to get a working download url with an option like
downloadurlmangle=”s/prdownload/download/”

(for more info see uscan manpage)

Another example: dev-perl/Math-BaseCnv or XML-Tidy  in upstream use
strange version numbers like 1.8.B59BrZ that should be mangled to 1.8

Summarizing we need:
- A base url and a file pattern to search for new upstream versions when
SRC_URI is not suitable
- some options for mangling retrieved data from the scan of upstream
using base url and pattern or using remote-id information

So our problem is: how can we store this data in a very flexible and
efficient way?
Proposed solutions:

1) Add an euscan tag with a custom namespace
Example:
euscan xmlns=http://euscan.iksaif.net;
 transformation
   regexpfroma/fromtob/to/regexp
   cpan-mangle/
   gentoo-mangle/
 /transformation
/euscan
Which means: apply regex s/a/b/ then apply cpan mangling rules and then
gentoo mangling rules.

2) Change quite heavily the remote-id tag:
   -  adding versionmanging and downloadmangling options that contain
regexes
   -  adding a new remote-id type called for example url, that tag will
contain the base url and the pattern

3) Add a watch tag to upstream with versionmangling and
downloadmangling options. This tag can have a type (and in that case the
data from remote-id is used) or can contain the base url and the file
pattern. (this is what is currently implemented for our tests).


So before going further, we would like some feedback from you on these
approaches.
What do you think about them? Which do you prefer? Do you think there’s
a better approach or some steps can be changed in a more efficient way?



Other examples:

dev-perl/XML-Tidy: # We have to strip trailing letters in version and
then apply cpan mangling rules
upstream
  remote-id type=cpanXML-Tidy/remote-id
  remote-id 

Re: [gentoo-dev] [RFC] euscan: Need to add more upstream info in metadata.xml

2012-08-10 Thread Gilles Dartiguelongue
Having done some debian packaging for work, I find watch files from
debian really helpful. Changing the format to a XML compatible one does
not seem like a hard work so I'll probably leave that up for others to
discuss.

Since you are proposing this, a side question is:
Why should we write SRC_URI in ebuilds if that info is now available in
metadata.xml ? (granted that we might still want to keep over-riding
this information in ebuilds)

-- 
Gilles Dartiguelongue e...@gentoo.org
Gentoo




Re: [gentoo-dev] [RFC] euscan: Need to add more upstream info in metadata.xml

2012-08-10 Thread Corentin Chary
On Fri, Aug 10, 2012 at 2:03 PM, Gilles Dartiguelongue e...@gentoo.org wrote:
 Having done some debian packaging for work, I find watch files from
 debian really helpful. Changing the format to a XML compatible one does
 not seem like a hard work so I'll probably leave that up for others to
 discuss.

 Since you are proposing this, a side question is:
 Why should we write SRC_URI in ebuilds if that info is now available in
 metadata.xml ? (granted that we might still want to keep over-riding
 this information in ebuilds)

It's not (only) SRC_URI, sometime it's completly different, sometimes
watch would contain only versionmangle since SRC_URI contains
enought informations for euscan... SRC_URI serves a totally different
purpose :).

-- 
Corentin Chary
http://xf.iksaif.net