Source: sphinx Version: 5.3.0-3 Severity: wishlist Countless packages contain a stanza in their docs/conf.py like: intersphinx_mapping = { "python": ("https://docs.python.org/3", None), }
Given packages in Debian cannot use network access when being built, and the dh_make Sphinx boilerplates suggest defining http_proxy to avoid Sphinx resolving this through the internet, one of these two things happens: 1) The maintainer patches the upstream source through debian/patches, to point these references to the local filesystem. That's actually what src:sphinx does as well for itself, through the intersphinx_local.diff patch. A quick codesearch[1] reveals ~385 packages doing something similar. 2) The maintainer does not patch the source, Sphinx attemps to fetch the file from the network, fails due to http_proxy, and the generated docs do not resolve these references. Build-time warnings are emitted of the form: WARNING: py:class reference target not found: pathlib.Path I don't know of an easy way to grep through the build logs to generate numbers. Anecdotally, I've seen quite a few packages in that category as well. (Perhaps one could add a tag to the Buildd Log Scanner[2] to scan for this?) It'd be great if intersphinx in Debian was patched to map these references to the local Debian package and also to generate the necessary dependencies -- perhaps guarded by a environment variable or command-line option that dh_sphinx would only pass, for example. Beyond patching the Sphinx code itself, there is of course the matter of generating these mappings, which is surprisingly non-trivial. From what I can tell the mappings need to be created heuristically, since I haven't seen of a way for a central Sphinx to document in metadata where the generation documentation will be published. I played around with a few ideas, and while I haven't settled on something that I feel is not dirty yet, I tried to implement something akin to what dh-python's pydist/generate_fallback_list.py does: have a script in the source to be executed manually periodically to regenerate the cache, which creates a mapping that is then committed to git, and shipped in the binary. So I implemented the attached proof of concept that: * Scans Contents-all and Contents-amd64 to find objects.inv files and maps them back to the binary packages; * Queries UDD (through one query with joins) to: - find the respective source packages for these binary packages - find the upstream metadata[3] for these source package * Prints a tab-separated intersphinx_mappings file that has: <documentation URL>\t<binary package>\<objects.inv file> It takes ~10s to run on my computer right now, which should be fine for being executed periodically by the maintainer. I'm sure there are many issues with this approach that I haven't thought through, as well as a number of corner cases, but I wanted to have something to kickstart this discussion beyond just wishful thinking! I'd love any feedback! Note that I haven't looked at all at what it would take to integrate this mapping to the Sphinx source (as well as ${sphinx:Depends}) as I thought it'd be good to validate the approach before I do so. Regards, Faidon 1: https://codesearch.debian.net/search?q=intersphinx_mapping+path%3Adebian%2F&literal=1&perpkg=1 2: https://qa.debian.org/bls/ 3: I ran into stale data in that table, which is now tracked as #1032587
#!/usr/bin/python3 # # Copyright © 2023 Faidon Liambotis <parav...@debian.org> # # Roughly based on dh-python/pydist/generate_fallback_list.py which is: # Copyright © 2010-2015 Piotr Ożarowski <pi...@debian.org> # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN # THE SOFTWARE. import re import sys try: from distro_info import DistroInfo # python3-distro-info package except ImportError: DistroInfo = None from gzip import decompress from pathlib import Path from urllib.parse import urlparse from urllib.request import urlopen import psycopg2 import psycopg2.extras if "--ubuntu" in sys.argv and DistroInfo: SOURCES = [ "http://archive.ubuntu.com/ubuntu/dists/%s/Contents-amd64.gz" % DistroInfo("ubuntu").devel(), ] else: SOURCES = [ "http://ftp.debian.org/debian/dists/unstable/main/Contents-all.gz", "http://ftp.debian.org/debian/dists/unstable/main/Contents-amd64.gz", ] EXTRA_MAPPINGS = ( ("https://docs.python.org/3/", "python3-doc", "/usr/share/doc/python3-doc/html"), ("https://requests.readthedocs.io/en/latest/", "python-requests-doc", "/usr/share/doc/python3-doc/html"), ) IGNORED_PKGS = {} IGNORED_DOCUMENTATION_URLS = { # points to a README.md "https://github.com/spl0k/supysonic/blob/master/README.md", } objects_inv_match = re.compile(r"/usr/.+/objects.inv").fullmatch data = "" cache_dir = Path("cache") if not cache_dir.is_dir(): cache_dir.mkdir() for source in SOURCES: basename = Path(urlparse(source).path).name cache_fpath = cache_dir / basename if not cache_fpath.exists(): with urlopen(source) as fp: source_data = fp.read() cache_fpath.write_bytes(source_data) else: source_data = cache_fpath.read_bytes() try: data += str(decompress(source_data), encoding="UTF-8") except UnicodeDecodeError: # Ubuntu data += str(decompress(source_data), encoding="ISO-8859-15") # If running from a debian.org box: # dbconn = psycopg2.connect("service=udd") dbconn = psycopg2.connect( host="udd-mirror.debian.net", dbname="udd", user="udd-mirror", password="udd-mirror", ) cursor = dbconn.cursor() cursor.execute( """ CREATE TEMPORARY TABLE intersphinx_mapping (package text, inventory_path text); CREATE INDEX intersphinx_mapping_package_idx ON intersphinx_mapping USING btree (package); """ ) result = [] # Contents file doesn't contain comment these days is_header = not data.startswith("bin") for line in data.splitlines(): if is_header: if line.startswith("FILE"): is_header = False continue try: path, desc = line.rsplit(maxsplit=1) except ValueError: # NOTE(jamespage) some lines in Ubuntu are not parseable. continue path = "/" + path.rstrip() section, pkg_name = desc.rsplit("/", 1) if pkg_name in IGNORED_PKGS: continue if not objects_inv_match(path): continue doc_path = path.rstrip("objects.inv") # While we could do: # cursor.execute("INSERT INTO intersphinx_mapping VALUES (%s, %s)", (pkg_name, path)) # ...this is slow due to database roundtrips, so we collect it all in memory in result[] # and then batch insert it outside the loop result.append((pkg_name, doc_path)) print(f"Found {len(result)} objects.inv paths") psycopg2.extras.execute_values( cursor, "INSERT INTO intersphinx_mapping VALUES %s", result, ) cursor.execute( """ SELECT DISTINCT(um.value), ap.package, im.inventory_path FROM public.all_packages AS ap JOIN intersphinx_mapping AS im ON ap.package = im.package JOIN public.upstream_metadata AS um ON ap.source = um.source WHERE um.key = 'Documentation' ORDER BY um.value; """ ) with Path("intersphinx_mapping").open("w") as fp: for documentation_url, pkg, local_doc_path in EXTRA_MAPPINGS: fp.write(f"{documentation_url}\t{pkg}\t{local_doc_path}\n") for documentation_url, pkg, local_doc_path in cursor.fetchall(): if documentation_url in [i[0] for i in EXTRA_MAPPINGS]: continue if documentation_url in IGNORED_DOCUMENTATION_URLS: continue fp.write(f"{documentation_url}\t{pkg}\t{local_doc_path}\n") print(f"Wrote {cursor.rowcount} mappings") cursor.close() dbconn.close()