Re: [Pywikipedia-l] The easiest way to fetch recently changed pages

Adam Klimont Tue, 07 Feb 2012 05:16:48 -0800

Thank you for your advice! I wrote a simple method which returns a set
of titles of pages that have been changed since "limit":
http://paste.pocoo.org/show/547240/. It does not return the exact set
(for this I think I would have to check the timestamps in the last
iteration), therefore in the worst case it would return 100 unwanted
titles, but this is not a problem for my purposes.


Cheers
alkamid

On 6 February 2012 02:27, Morten Wang <[email protected]> wrote:
> To me the implementation depends on what alkamid actually wants to do.
>  For keeping some of SuggestBot's data sources up-to-date I use the
> site object's recentchanges() generator to grab data (and although one
> can only get a limited amount at each step, I've never had troubles
> exhausting the generator), where it's easy to check the edit timestamp
> to stop iterating when necessary.  I then store page titles in a
> set(), which can be fed to a PagesFromTitlesGenerator, and I chain
> said generator with a PreloadingGenerator to get the latest revisions.
>
> In my experience only a minority of a Wikipedia edition's articles are
> updated on a weekly basis, so using allpages() results in a lot of
> unnecessary data.
>
>
> Cheers,
> Morten
>
> On 5 February 2012 17:28, Dr. Trigon <[email protected]> wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>>> past week? I thought of using the AllPagesPageGenerator and
>>> executing editTime() on each page, but this method gives me only
>>> zeros if the page was not read before (e.g. I have to call
>>> page.get() first in order for editTime() to work properly). Is
>>> there any edit-time-related piece of information I can get from a
>>> generated list of pages? Or maybe there is another page generator
>>> suitable for me?
>>
>> Everything using 'getall' from 'wikipedia.py' (imported as 'pywikibot')
>> does give you the first history entry WITHOUT having to trigger
>> page.get(). E.g. the 'PreloadingGenerator' and as you can chain the
>> generators you can first setup your generator as 'gen1' and then pass
>> 'gen1' to a 'PreloadingGenerator' (may be in a 'ThreadedGenerator'...)
>> in order to get the first history entry of every page... In
>> 'sum_disc.py' of the DrTrigonBot repo is an example for this.
>>
>> Greetings
>>
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.12 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>
>> iEYEARECAAYFAk8vEKcACgkQAXWvBxzBrDAMTwCfe7kKUHrtgsE+EguKAuiWoODb
>> zr4An2M5d6G0XZJGMntDLS54DL6XGdug
>> =37Hk
>> -----END PGP SIGNATURE-----
>>
>> _______________________________________________
>> Pywikipedia-l mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l
>
> _______________________________________________
> Pywikipedia-l mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

_______________________________________________
Pywikipedia-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/pywikipedia-l

Re: [Pywikipedia-l] The easiest way to fetch recently changed pages

Reply via email to