Good question.  I don't know if there is a desired outcome of this
conversation.  My purpose in starting this thread was to have a discussion
about the problem we face so that we can start thinking better about it.

I don't think I have a task set up for "specifying a schema for revisions
in hadoop".  The closest bit we have on the R&D board is
https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity --
which is the immediate goal of what I'm working on with Andrew right now.
A more long-term goal would be to solve similar problems more easily in the
future.

-Aaron

On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <[email protected]>
wrote:

> I'd like to put a placeholder in Phab or Trello for this work, but please
> help me out because I am still new....could someone help summarize the
> context and what we are trying solve?
>
> Also, would this go into Research, Eng or Refinery backlog?
>
> Thanks!
>
> On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu <[email protected]>
> wrote:
>
>> Bikeshed indeed -- this seems to be a project that could soak up a lot of
>>> time. I'm with Aaron -- let's be consistent with the principle of least
>>> surprise and use an existing identifier. The database seems as good a place
>>> to start as any.
>>>
>>
>> I disagree that this is bikeshedding.  The reason people look back after
>> a year at a project and go "yuck, wish we named those things differently"
>> is precisely because this type of effort is incorrectly labeled as
>> bikeshedding.  We are *not* talking a bout a bike shed.  We're talking
>> about a schema that will hopefully serve hundreds or thousands of
>> researchers and our own growing team (I'm considering both Aaron's revision
>> schema and the data warehouse schema).
>>
>>
>>> So, I'm not sure that is necessary for the term "identifier" which I
>>>> assume that "id" abbreviates.  Regardless it seems clear that these numbers
>>>> are thought of as primary identifiers of a namespace that can otherwise
>>>> have many names.  For example, see this snippet from the result of this
>>>> query:
>>>> http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=jsonfm
>>>>
>>>> "1": {
>>>>                 "id": 1,
>>>>                 "case": "first-letter",
>>>>                 "*": "Discusi\u00f3n",
>>>>                 "subpages": "",
>>>>                 "canonical": "Talk"
>>>>
>>>> },
>>>>
>>>
>> Fair enough, namespace_id seems like a good name for a property of a page
>> entity then.
>>
>>
>>> I don't see us getting rid of legacy naming right now.  I don't see how
>>>> adding a new name helps anyone -- veteran or newbie.
>>>>
>>>
>> I disagree that we have to care at all about legacy names.  I disagree
>> that the principle of least surprise leads one to prefer database names.
>> To me, that's more surprising because database conventions have no place in
>> json.  If I was new to this world, it also seems more surprising.  If I was
>> an existing user, I don't think I would be at all surprised as long as the
>> names were clear and the schemas well documented.  This page_namespace_id
>> is a bit of a red herring because we have harder things to tackle like
>> "restrictions".
>>
>>
>>> However, if we were to develop a mapping of canonical names and pursue
>>>> that from here forward, we might be able to move beyond the old names for
>>>> the most important data sources in a few of years.   However, I'm skeptical
>>>> that we'll ever be able to change any production DB field names.
>>>>
>>>
>> We need not be tied to the production db names.  The data warehouse
>> effort is trying to transform a confusing schema riddled with
>> idiosyncrasies into a clean, easy to understand, and easy to work with,
>> dimensional model.  In the process, we are also trying to capture changes
>> to objects over time so we are greatly expanding the usefulness of the
>> database.  Good naming matters and we should take our time.
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to