On Fri, Nov 1, 2024, at 6:10 PM, Bob Weinand wrote:
> On 1.11.2024 22:41:29, Larry Garfield wrote:
>> In a similar vein to approving the use of software, Roman Pronskiy asked for 
>> my help putting together an RFC on collecting analytics for PHP.net.
>>
>> https://wiki.php.net/rfc/phpnet-analytics
>>
>> Of particular note:
>>
>> * This is self-hosted, first-party only.  No third parties get data, so no 
>> third parties can do evil things with it.
>> * There is no plan to collect any PII.
>> * The goal is to figure how how to most efficiently spend Foundation money 
>> improving php.net, something that is sorely needed.
>>
>> Ideally we'd have this in place by the 8.4 release or shortly thereafter, 
>> though I realize that's a little tight on the timeline.
>
> Hey Larry,
>
> I have a couple concerns and questions:
>
> Is there a way to track analytics with only transient data? As in, data 
> actually stored is always already anonymized enough that it would be 
> unproblematic to share it with everyone?
> Or possibly, is there a retention period for the raw data after which 
> only anonymized data remains?

The plan is to configure Matomo to not collect anything non-anonymous to begin 
with, to the extent possible.  We're absolutely not talking about user-stalking 
like ad companies do, or anything even remotely close to that.

I'm not convinced that publishing raw, even anonymized data, is valuable or 
responsible.  I don't know of any other sites off hand that publish their raw 
analytics, and I don't know what purpose that would serve other than just a 
principled "radical transparency" stance, which I generally don't agree with.

However, having an automated aggregate dashboard similar to 
https://analytics.bookstackapp.com/bookstackapp.com (made by a different tool, 
but same idea) that we could make public is the goal, but we don't want to do 
that until it's been running a while and we're sure that nothing personally 
identifiable could leak through that way.

> Do you actually have a plan what to use that data for? The RFC mostly 
> talks about "high traffic". But does that mean anything? I do look at a 
> documentation page, because I need to look something specific up (what 
> was the order of arguments of strpos again?). I may only look shortly at 
> it. Maybe even often. But it has absolutely zero signal on whether the 
> documentation page is good enough. In that case I don't look at the 
> comments either. Comments are something you rarely look at, mostly the 
> first time you want to even use a function.

Right now, the key problem is that there's a lot of "we don't know what we 
don't know."  We want to improve the site and docs, the Foundation wants to 
spend money on doing so, but other than "fill in the empty pages" we have no 
definition of "improve" to work from.  The intent is that better data will give 
us a better sense of what "improve" even means.  

It would also be useful for marketing campaigns, even on-site.  Eg, if we spend 
the time to write a "How to convince your boss to use PHP" page... how useful 
is it?  From logs, all we could get is page count.  That's it.  Or the 
PHP-new-release landing page that we've put up for the last several releases.  
Do people actually get value of that?  Do they bother to scroll down through 
each section or do they just look at the first one or two and leave, meaning 
the time we spent on any other items is wasted?  Right now, we have no idea if 
the time spent on those is even useful.  

Another example off the top of my head: Right now, the enum documentation is 
spread across a dozen sub-pages.  I don't know why I did that exactly in the 
first place rather than one huge page, other than "huge pages bad."  But are 
they bad?  Would it be better to combine enums back into fewer pages, or to 
split the visibility long-page up into smaller ones?  I have no idea.  We need 
data to answer that.

It's also absolutely true that analytics are not the end of data collection.  
User surveys, usability tests, etc. are also highly valuable, and can get you a 
different kind of data.  We should likely do those at some point, but that 
doesn't make automated analytics not useful.


Another concern with just using raw logs is that it would be more work to 
setup, and have more moving parts to break.  Let's be honest, PHP has an 
absolutely terrible track record when it comes to keeping our moving parts 
working, and the Infra Team right now is tiny.  The bus factor there is a 
concern.  Using a client-side tracker is the more-supported and 
fewer-custom-scripts approach, which makes it easier for someone new to pick it 
up when needed.

Logs also will fold anyone behind a NAT together into a single IP, and thus 
"user."  IP address is in general a pretty poor way of uniquely identifying 
people with the number of translation layers on the Internet these days.

> Overall I feel like the signal we can get from using a JS tracker 
> specifically is comparatively low to the point it's not actually worth it.

Some more things a client-side tracker could do that logs cannot:

* How many people are accessing the site from a desktop vs mobile?
* What speed connection do people have?
* How many people are using the in-browser Wasm code runner that is currently 
being worked on?  cf: https://github.com/php/web-php/pull/1097

Also, for reference, most language sites do have some kind of analytics, 
usually Google:

https://www.python.org –Plausible.io, Google analytics
https://go.dev/ — Google Analytics
https://www.rust-lang.org/ –N/A
https://nodejs.org/ – Google Analytics
https://www.typescriptlang.org/ – N/A
https://kotlinlang.org/  – Google Analytics
https://www.swift.org/  – Adobe Analytics
https://www.ruby-lang.org/  – Google Analytics

We'd be the only one with a self-hosted option, making it the most 
privacy-conscious of the bunch.

As far as blocking the analytics goes, Matomo uses a cookieless approach, so 
it's rarely blocked (and would not need a GDPR-compliance banner).  Even if 
someone wanted to block it, meh.  We'd still be getting enough signal to make 
informed decisions.

--Larry Garfield

Reply via email to