Source: sphinx
Version: 5.3.0-3
Severity: wishlist

Countless packages contain a stanza in their docs/conf.py like:
    intersphinx_mapping = {
        "python": ("https://docs.python.org/3";, None),
    }

Given packages in Debian cannot use network access when being built, and
the dh_make Sphinx boilerplates suggest defining http_proxy to avoid
Sphinx resolving this through the internet, one of these two things
happens:

1) The maintainer patches the upstream source through debian/patches, to
point these references to the local filesystem. That's actually what
src:sphinx does as well for itself, through the intersphinx_local.diff
patch. A quick codesearch[1] reveals ~385 packages doing something
similar.

2) The maintainer does not patch the source, Sphinx attemps to fetch the
file from the network, fails due to http_proxy, and the generated docs
do not resolve these references. Build-time warnings are emitted of the
form:
   WARNING: py:class reference target not found: pathlib.Path
I don't know of an easy way to grep through the build logs to generate
numbers. Anecdotally, I've seen quite a few packages in that category as
well. (Perhaps one could add a tag to the Buildd Log Scanner[2] to scan
for this?)

It'd be great if intersphinx in Debian was patched to map these
references to the local Debian package and also to generate the
necessary dependencies -- perhaps guarded by a environment variable or
command-line option that dh_sphinx would only pass, for example.

Beyond patching the Sphinx code itself, there is of course the matter of
generating these mappings, which is surprisingly non-trivial. From what
I can tell the mappings need to be created heuristically, since I
haven't seen of a way for a central Sphinx to document in metadata where
the generation documentation will be published.

I played around with a few ideas, and while I haven't settled on
something that I feel is not dirty yet, I tried to implement something
akin to what dh-python's pydist/generate_fallback_list.py does: have a
script in the source to be executed manually periodically to regenerate
the cache, which creates a mapping that is then committed to git, and
shipped in the binary. 

So I implemented the attached proof of concept that:
* Scans Contents-all and Contents-amd64 to find objects.inv files
  and maps them back to the binary packages;
* Queries UDD (through one query with joins) to:
  - find the respective source packages for these binary packages
  - find the upstream metadata[3] for these source package
* Prints a tab-separated intersphinx_mappings file that has:
  <documentation URL>\t<binary package>\<objects.inv file>

It takes ~10s to run on my computer right now, which should be fine for
being executed periodically by the maintainer.

I'm sure there are many issues with this approach that I haven't thought
through, as well as a number of corner cases, but I wanted to have
something to kickstart this discussion beyond just wishful thinking!

I'd love any feedback! Note that I haven't looked at all at what it
would take to integrate this mapping to the Sphinx source (as well as
${sphinx:Depends}) as I thought it'd be good to validate the approach
before I do so.

Regards,
Faidon

1: 
https://codesearch.debian.net/search?q=intersphinx_mapping+path%3Adebian%2F&literal=1&perpkg=1
2: https://qa.debian.org/bls/
3: I ran into stale data in that table, which is now tracked as #1032587
#!/usr/bin/python3
#
# Copyright © 2023 Faidon Liambotis <parav...@debian.org>
#
# Roughly based on dh-python/pydist/generate_fallback_list.py which is:
# Copyright © 2010-2015 Piotr Ożarowski <pi...@debian.org>
#
# Permission is hereby granted, free of charge, to any person obtaining a copy
# of this software and associated documentation files (the "Software"), to deal
# in the Software without restriction, including without limitation the rights
# to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
# copies of the Software, and to permit persons to whom the Software is
# furnished to do so, subject to the following conditions:
#
# The above copyright notice and this permission notice shall be included in
# all copies or substantial portions of the Software.
#
# THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
# IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
# FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
# AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
# LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
# OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
# THE SOFTWARE.

import re
import sys

try:
    from distro_info import DistroInfo  # python3-distro-info package
except ImportError:
    DistroInfo = None
from gzip import decompress
from pathlib import Path
from urllib.parse import urlparse
from urllib.request import urlopen

import psycopg2
import psycopg2.extras

if "--ubuntu" in sys.argv and DistroInfo:
    SOURCES = [
        "http://archive.ubuntu.com/ubuntu/dists/%s/Contents-amd64.gz"; % DistroInfo("ubuntu").devel(),
    ]
else:
    SOURCES = [
        "http://ftp.debian.org/debian/dists/unstable/main/Contents-all.gz";,
        "http://ftp.debian.org/debian/dists/unstable/main/Contents-amd64.gz";,
    ]

EXTRA_MAPPINGS = (
    ("https://docs.python.org/3/";, "python3-doc", "/usr/share/doc/python3-doc/html"),
    ("https://requests.readthedocs.io/en/latest/";, "python-requests-doc", "/usr/share/doc/python3-doc/html"),
)

IGNORED_PKGS = {}
IGNORED_DOCUMENTATION_URLS = {
    # points to a README.md
    "https://github.com/spl0k/supysonic/blob/master/README.md";,
}

objects_inv_match = re.compile(r"/usr/.+/objects.inv").fullmatch

data = ""
cache_dir = Path("cache")
if not cache_dir.is_dir():
    cache_dir.mkdir()
for source in SOURCES:
    basename = Path(urlparse(source).path).name
    cache_fpath = cache_dir / basename
    if not cache_fpath.exists():
        with urlopen(source) as fp:
            source_data = fp.read()
            cache_fpath.write_bytes(source_data)
    else:
        source_data = cache_fpath.read_bytes()

    try:
        data += str(decompress(source_data), encoding="UTF-8")
    except UnicodeDecodeError:  # Ubuntu
        data += str(decompress(source_data), encoding="ISO-8859-15")

# If running from a debian.org box:
# dbconn = psycopg2.connect("service=udd")
dbconn = psycopg2.connect(
    host="udd-mirror.debian.net",
    dbname="udd",
    user="udd-mirror",
    password="udd-mirror",
)
cursor = dbconn.cursor()
cursor.execute(
    """
    CREATE TEMPORARY TABLE intersphinx_mapping
      (package text, inventory_path text);
    CREATE INDEX intersphinx_mapping_package_idx
      ON intersphinx_mapping USING btree (package);
    """
)

result = []

# Contents file doesn't contain comment these days
is_header = not data.startswith("bin")
for line in data.splitlines():
    if is_header:
        if line.startswith("FILE"):
            is_header = False
        continue
    try:
        path, desc = line.rsplit(maxsplit=1)
    except ValueError:
        # NOTE(jamespage) some lines in Ubuntu are not parseable.
        continue
    path = "/" + path.rstrip()
    section, pkg_name = desc.rsplit("/", 1)
    if pkg_name in IGNORED_PKGS:
        continue

    if not objects_inv_match(path):
        continue

    doc_path = path.rstrip("objects.inv")

    # While we could do:
    #   cursor.execute("INSERT INTO intersphinx_mapping VALUES (%s, %s)", (pkg_name, path))
    # ...this is slow due to database roundtrips, so we collect it all in memory in result[]
    # and then batch insert it outside the loop

    result.append((pkg_name, doc_path))

print(f"Found {len(result)} objects.inv paths")
psycopg2.extras.execute_values(
    cursor,
    "INSERT INTO intersphinx_mapping VALUES %s",
    result,
)

cursor.execute(
    """
    SELECT DISTINCT(um.value), ap.package, im.inventory_path
    FROM public.all_packages AS ap
    JOIN intersphinx_mapping AS im
      ON ap.package = im.package
    JOIN public.upstream_metadata AS um
      ON ap.source = um.source
    WHERE um.key = 'Documentation'
    ORDER BY um.value;
    """
)

with Path("intersphinx_mapping").open("w") as fp:
    for documentation_url, pkg, local_doc_path in EXTRA_MAPPINGS:
        fp.write(f"{documentation_url}\t{pkg}\t{local_doc_path}\n")

    for documentation_url, pkg, local_doc_path in cursor.fetchall():
        if documentation_url in [i[0] for i in EXTRA_MAPPINGS]:
            continue
        if documentation_url in IGNORED_DOCUMENTATION_URLS:
            continue
        fp.write(f"{documentation_url}\t{pkg}\t{local_doc_path}\n")

print(f"Wrote {cursor.rowcount} mappings")

cursor.close()
dbconn.close()

Reply via email to