[Pywikipedia-bugs] [Maniphest] [Created] T129021: ReferringPageGenerator only returns 500 pages instead of all

Multichill Sun, 06 Mar 2016 05:03:32 -0800

Multichill created this task.
Herald added subscribers: pywikibot-bugs-list, Aklapper.


TASK DESCRIPTION
  I dusted off an old bot that used to work. It uses the following code to get 
all items that use a certain property:
  
    repo = pywikibot.Site().data_repository()
    ppage = pywikibot.PropertyPage(repo, u'Property:P197')
    
    gen = 
pagegenerators.NamespaceFilterPageGenerator(pagegenerators.ReferringPageGenerator(ppage,
 withTemplateInclusion=False, onlyTemplateInclusion=False), namespaces=[0])
  
  This used to return all usage, but now seems to get stuck at about 500 items. 
The last item I get is Q455649 and that so happens to be the last item on 
https://www.wikidata.org/w/index.php?title=Special:WhatLinksHere/Property:P197&limit=500
 . So it looks like not all items are retrieved. 
pagegenerators.ReferringPageGenerator is supposed to return all usage:
  
    def ReferringPageGenerator(referredPage, followRedirects=False,
                               withTemplateInclusion=True,
                               onlyTemplateInclusion=False,
                               total=None, content=False):
        """Yield all pages referring to a specific page."""
        return referredPage.getReferences(
            follow_redirects=followRedirects,
            withTemplateInclusion=withTemplateInclusion,
            onlyTemplateInclusion=onlyTemplateInclusion,
            total=total, content=content)
  
  Total is not set so it should pass total=None to Page.getReferences()
  
    def getReferences(self, follow_redirects=True, withTemplateInclusion=True,
                      onlyTemplateInclusion=False, redirectsOnly=False,
                      namespaces=None, total=None, content=False):
        """
        Return an iterator all pages that refer to or embed the page.
    
        If you need a full list of referring pages, use
        C{pages = list(s.getReferences())}
    
        @param follow_redirects: if True, also iterate pages that link to a
            redirect pointing to the page.
        @param withTemplateInclusion: if True, also iterate pages where self
            is used as a template.
        @param onlyTemplateInclusion: if True, only iterate pages where self
            is used as a template.
        @param redirectsOnly: if True, only iterate redirects to self.
        @param namespaces: only iterate pages in these namespaces
        @param total: iterate no more than this number of pages in total
        @param content: if True, retrieve the content of the current version
            of each referring page (default False)
        """
        # N.B.: this method intentionally overlaps with backlinks() and
        # embeddedin(). Depending on the interface, it may be more efficient
        # to implement those methods in the site interface and then combine
        # the results for this method, or to implement this method and then
        # split up the results for the others.
        return self.site.pagereferences(
            self,
            followRedirects=follow_redirects,
            filterRedirects=redirectsOnly,
            withTemplateInclusion=withTemplateInclusion,
            onlyTemplateInclusion=onlyTemplateInclusion,
            namespaces=namespaces,
            total=total,
            content=content
        )
  
  This passes on the work to Site.pagereferences() again with total=None and 
withTemplateInclusion=False, onlyTemplateInclusion=False
  
    def pagereferences(self, page, followRedirects=False, filterRedirects=None,
                           withTemplateInclusion=True, 
onlyTemplateInclusion=False,
                           namespaces=None, total=None, content=False):
            """
            Convenience method combining pagebacklinks and page_embeddedin.
    
            @param namespaces: If present, only return links from the namespaces
                in this list.
            @type namespaces: iterable of basestring or Namespace key,
                or a single instance of those types.  May be a '|' separated
                list of namespace identifiers.
            @raises KeyError: a namespace identifier was not resolved
            @raises TypeError: a namespace identifier has an inappropriate
                type such as NoneType or bool
            """
            if onlyTemplateInclusion:
                return self.page_embeddedin(page, namespaces=namespaces,
                                            filterRedirects=filterRedirects,
                                            total=total, content=content)
            if not withTemplateInclusion:
                return self.pagebacklinks(page, followRedirects=followRedirects,
                                          filterRedirects=filterRedirects,
                                          namespaces=namespaces,
                                          total=total, content=content)
  
  (skipped the last part). It should hit on the "if not withTemplateInclusion"
  
    def pagebacklinks(self, page, followRedirects=False, filterRedirects=None,
                      namespaces=None, total=None, content=False):
        """Iterate all pages that link to the given page.
    
        @param page: The Page to get links to.
        @param followRedirects: Also return links to redirects pointing to
            the given page.
        @param filterRedirects: If True, only return redirects to the given
            page. If False, only return non-redirect links. If None, return
            both (no filtering).
        @param namespaces: If present, only return links from the namespaces
            in this list.
        @type namespaces: iterable of basestring or Namespace key,
            or a single instance of those types.  May be a '|' separated
            list of namespace identifiers.
        @param total: Maximum number of pages to retrieve in total.
        @param content: if True, load the current content of each iterated page
            (default False)
        @raises KeyError: a namespace identifier was not resolved
        @raises TypeError: a namespace identifier has an inappropriate
            type such as NoneType or bool
        """
        bltitle = page.title(withSection=False).encode(self.encoding())
        blargs = {"gbltitle": bltitle}
        if filterRedirects is not None:
            blargs["gblfilterredir"] = (filterRedirects and "redirects" or
                                        "nonredirects")
        blgen = self._generator(api.PageGenerator, type_arg="backlinks",
                                namespaces=namespaces, total=total,
                                g_content=content, **blargs)
        if followRedirects:
            # links identified by MediaWiki as redirects may not really be,
            # so we have to check each "redirect" page and see if it
            # really redirects to this page
            # see fixed MediaWiki bug T9304
            redirgen = self._generator(api.PageGenerator,
                                       type_arg="backlinks",
                                       gbltitle=bltitle,
                                       gblfilterredir="redirects")
            genlist = {None: blgen}
            for redir in redirgen:
                if redir == page:
                    # if a wiki contains pages whose titles contain
                    # namespace aliases that existed before those aliases
                    # were defined (example: [[WP:Sandbox]] existed as a
                    # redirect to [[Wikipedia:Sandbox]] before the WP: alias
                    # was created) they can be returned as redirects to
                    # themselves; skip these
                    continue
                if redir.getRedirectTarget() == page:
                    genlist[redir.title()] = self.pagebacklinks(
                        redir, followRedirects=True,
                        filterRedirects=filterRedirects,
                        namespaces=namespaces,
                        content=content
                    )
            return itertools.chain(*list(genlist.values()))
        return blgen
  
  This function doesn't seem to contain any loop so it probably only hits 
https://www.wikidata.org/w/api.php?action=help&recursivesubmodules=1#query+backlinks
 once. Maybe someone broke this when deprecating "step"?

TASK DETAIL
  https://phabricator.wikimedia.org/T129021

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Multichill
Cc: Aklapper, pywikibot-bugs-list, Multichill



_______________________________________________
pywikibot-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikibot-bugs

[Pywikipedia-bugs] [Maniphest] [Created] T129021: ReferringPageGenerator only returns 500 pages instead of all

Reply via email to