>
> Bikeshed indeed -- this seems to be a project that could soak up a lot of
> time. I'm with Aaron -- let's be consistent with the principle of least
> surprise and use an existing identifier. The database seems as good a place
> to start as any.
>

I disagree that this is bikeshedding.  The reason people look back after a
year at a project and go "yuck, wish we named those things differently" is
precisely because this type of effort is incorrectly labeled as
bikeshedding.  We are *not* talking a bout a bike shed.  We're talking
about a schema that will hopefully serve hundreds or thousands of
researchers and our own growing team (I'm considering both Aaron's revision
schema and the data warehouse schema).


> So, I'm not sure that is necessary for the term "identifier" which I
>> assume that "id" abbreviates.  Regardless it seems clear that these numbers
>> are thought of as primary identifiers of a namespace that can otherwise
>> have many names.  For example, see this snippet from the result of this
>> query:
>> http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=jsonfm
>>
>> "1": {
>>                 "id": 1,
>>                 "case": "first-letter",
>>                 "*": "Discusi\u00f3n",
>>                 "subpages": "",
>>                 "canonical": "Talk"
>>
>> },
>>
>
Fair enough, namespace_id seems like a good name for a property of a page
entity then.


> I don't see us getting rid of legacy naming right now.  I don't see how
>> adding a new name helps anyone -- veteran or newbie.
>>
>
I disagree that we have to care at all about legacy names.  I disagree that
the principle of least surprise leads one to prefer database names.  To me,
that's more surprising because database conventions have no place in json.
If I was new to this world, it also seems more surprising.  If I was an
existing user, I don't think I would be at all surprised as long as the
names were clear and the schemas well documented.  This page_namespace_id
is a bit of a red herring because we have harder things to tackle like
"restrictions".


> However, if we were to develop a mapping of canonical names and pursue
>> that from here forward, we might be able to move beyond the old names for
>> the most important data sources in a few of years.   However, I'm skeptical
>> that we'll ever be able to change any production DB field names.
>>
>
We need not be tied to the production db names.  The data warehouse effort
is trying to transform a confusing schema riddled with idiosyncrasies into
a clean, easy to understand, and easy to work with, dimensional model.  In
the process, we are also trying to capture changes to objects over time so
we are greatly expanding the usefulness of the database.  Good naming
matters and we should take our time.
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to