Heya!

I'm interested in making a plan for capturing editors, but there are things
to unpack.

I have notes ready that I'm happy to add to a new page in the wiki, but
looking at the "RecentChanges" I see that there isn't a page relating to
this that has been added recently (as far as I can tell!). Is that correct?

I am very happy to make a new page and put my notes and analysis so far
there?

In terms of traffic and popularity: from my analysis I see traffic is in
the order of 100-150K page load per day. It's not just "long tail", the
usage chart is an L (attached below).

The 503s are quite regular for me though I'm seeing better stability
lately. I think rate-limiting is a good idea, I was never able to achieve
any decent number of downloads and overall it took me weeks to patiently
find stability to gather what I was able to. In particular there are 2x
cases that behaviourally seem to incur much more load (technically it must
be the case that a data store of some kind must be "queried"/scanned,
though I'm not sure what form this takes specifically) and are far more
likely to bring the site down (503). The first is not returning
conventional 404 pages, but instead placeholder pages, which I think for a
wiki is quite lovely behaviour, but it does mean there are thousands and
thousands of very real-looking links throughout that are not real pages,
but the only way to determine one way or the other is to request the page,
which per above is much slower than checking real pages, I was caught out
by this and it introduced a layer of effort that was more than I expected,
I ended up writing some tooling around this. The other subtle element that
impacts load I believe is that the links in the navigation on every page at
to parameterised urls such a `?action=` which I'm guessing aren't (can't
be) statically stored, therefore again query/scan required, therefore
slow/inclined to bring down the site and there are multiple of these on
every single page. I filtered out the parameterised paths quick-smart
because it was introducing such a slow down, but I see less attentive
spiders do not do this. For example it was brought to my attention by Keith
that the Tehran Python User Group are also interested in this space
currently and were doing the same thing:
https://github.com/tehpug/python_org_data . This is pretty nice project and
pretty impressive, but upon analysis of their work I see that it stalled
about a quarter of the way through in terms of actual data gathering,
having apparently been caught out by both the above, I absolutely
understand why the project slowed -- it took me in the order of 7-8 weeks
of persistence and additionally recursing and recursing to figure out the
404/placeholder links (which I assume, hypothetically, would never end). To
be honest I'd needed to layer so much on in addition to a "normal" data
collection exercise that it was getting out of control by about week 10 so
I took the approach of creating a fresh tabla-rasa project and
cherry-picking all the "known-good" pages across before I got access to the
actual export, which I still haven't been able to carve out the time to
reconcile against my work, as I'm choosing to prioritise other ("real")
parts of the project.

Materially these issues don't make a difference and this is not important
to fix or address, but I'm more just describing the experience of this
class of outsider looking at the wiki.

For me I consider all the above just the side-show. The main event is
actually analysing the pages themselves, and I have a lot more to say about
that and have taken much more care. I believe I've now "touched" every page
(through many hours/day of meticulously applying data wrangling skills),
but this part will merrily take all the time/energy given to it, and I'm
still chipping away as time presents.

IMHO: the 2 most critical parts are:

* How to make most easy for editors
* Making the pages as useful as possible

I have a tremendous amount to say and contribute to both these parts, and
ideas for what I think would be good, though my opinions are only 1 set of
opinions. I'm ready right now to have this conversation, just waiting for
the correct forum.

So back to the start: I'm happy to make a wiki page! (Like the complete
nerd I clearly am:) the page title in my notes is "Reporting on PSF Wiki
project analysis", I feel like this is not ideal which is why I was waiting
for the new page discussed above.

I don't mind the idea of something like "PSF Wiki upgrade" or maybe even
back to the original idea of "PSF Wiki WG"? (naming things is hard)

Kinds regards,
---
Elena Williams
Github: elena <http://github.com/elena/>

[image: image.png]

Removing top 100:

[image: image.png]



On Thu, 6 Mar 2025 at 21:28, Marc-Andre Lemburg <m...@egenix.com> wrote:

> We can hash out a plan to do a new drive for editors in the coming weeks.
>
> I'll try to put together a wiki page outlining what we've discussed so far.
>
> I'm also having a call with a lead moin2 developer next week to see how
> realistic migrating to moin2 is at this point.
>
> Since I was seeing a few 503s when using the wiki recently, I asked our
> infra team for help. They will add another vCPU to the VM to help address
> load spikes. Rate limiting should further help against LLM scrapers causing
> too much load. Moin surge protection is already in place (
> https://moinmo.in/HelpOnConfiguration/SurgeProtection).
>
>
> On 05.03.2025 20:55, Elena Williams via pydotorg-www wrote:
>
> These later numbers are more consistent with what I have found and can
> actually be seen.
>
> I'm happy to help clean up having done substantial work already on what
> this could look like and strategies (particularly looking at the discussion
> from the recent docs meeting), though not sure how to action.
>
>
> ---
> Elena Williams
>
>
> On Thu, 6 Mar 2025 at 05:38, Marc-Andre Lemburg <m...@egenix.com> wrote:
>
>> Correction for the numbers: We have 3400+ pages and 47k users.
>>
>> I had looked at a backup which doesn't remove things which were deleted
>> on the main server - because unfortunately, moin's logic for deleting pages
>> is to actually delete them on disk, without any way to get them back.
>>
>> Instead of deleting a page, it's normally better to either add a redirect
>> or to put a notice on the page that the content was cleared. That way, the
>> history remains available. It may actually be a good idea to disable the
>> delete action (if possible, I'd have to check).
>>
>> On 05.03.2025 12:47, Marc-Andre Lemburg wrote:
>>
>> FYI: I've started looking into the moin2 migration...
>>
>>
>> https://github.com/moinwiki/moin/discussions/1717#discussioncomment-12399187
>>
>> In order to get there, we will need to do a test installation to hash out
>> any problems we may run into and evaluate the state of moin2.
>>
>> They just released 2.0.0b2.
>>
>> I'll see whether I can find some time later this week to get something
>> going.
>>
>> *I also checked our current stats:*
>>
>> We have 32k pages in the wiki and 221k users.
>>
>> Those numbers are what we have in the backend. Moin itself lists the
>> number of pages as 3436.
>>
>> Looking at the page names, we'll be able to clean up a lot of spam pages
>> which have accumulated before we added the editor signup requirement. Many
>> of those are empty pages, so we should be able to write a tool to clean
>> those up.
>>
>> It looks like Moin filters out those empty pages itself, since the title
>> index does not list them:
>>
>> https://wiki.python.org/moin/TitleIndex
>>
>> Scanning through those 3.4k page titles, most of those look legitimate.
>> And there's a lot of history in there :-)
>>
>> Similarly, we should be able to go through the user accounts and clear
>> out all accounts which have not done any edits, in order to bring the
>> numbers down.
>>
>> Thanks,
>>
>> --
>> Marc-Andre Lemburg
>> eGenix.com
>>
>> Professional Python Services directly from the Experts (#1, Mar 05 2025)
>> >>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>> >>> Python Product Development ...        https://consulting.egenix.com/
>> ________________________________________________________________________
>>
>> ::: We implement business ideas - efficiently in both time and costs :::
>>
>>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>>            Registered at Amtsgericht Duesseldorf: HRB 46611
>>                https://www.egenix.com/company/contact/
>>                      https://www.malemburg.com/
>>
>>
>> _______________________________________________
>> pydotorg-www mailing 
>> listpydotorg-www@python.orghttps://mail.python.org/mailman/listinfo/pydotorg-www
>>
>> --
>> Marc-Andre Lemburg
>> eGenix.com
>>
>> Professional Python Services directly from the Experts (#1, Mar 05 2025)
>> >>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>> >>> Python Product Development ...        https://consulting.egenix.com/
>> ________________________________________________________________________
>>
>> ::: We implement business ideas - efficiently in both time and costs :::
>>
>>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>>            Registered at Amtsgericht Duesseldorf: HRB 46611
>>                https://www.egenix.com/company/contact/
>>                      https://www.malemburg.com/
>>
>>
> _______________________________________________
> pydotorg-www mailing 
> listpydotorg-www@python.orghttps://mail.python.org/mailman/listinfo/pydotorg-www
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Mar 06 2025)
> >>> Python Projects, Coaching and Support ...    https://www.egenix.com/
> >>> Python Product Development ...        https://consulting.egenix.com/
> ________________________________________________________________________
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
>    eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>     D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>            Registered at Amtsgericht Duesseldorf: HRB 46611
>                https://www.egenix.com/company/contact/
>                      https://www.malemburg.com/
>
> _______________________________________________
> pydotorg-www mailing list
> pydotorg-www@python.org
> https://mail.python.org/mailman/listinfo/pydotorg-www
>
_______________________________________________
pydotorg-www mailing list
pydotorg-www@python.org
https://mail.python.org/mailman/listinfo/pydotorg-www

Reply via email to