Re: [Analytics] The state of field names in MediaWiki data

Grace Gellerman Fri, 12 Dec 2014 09:22:56 -0800

I appreciate Dan's passion and tenacity.  I barely know what y'all are
talking about, but I can tell that I support his commitment to good
naming.  Thanks, Dan!


For everything else, I created tickets to capture what you are working on
now and what should go in the backlog.

Feel free to correct or wordsmith any errors.

For Aaron:

1. added to in progress on R & D Trello board:

https://trello.com/c/Yuki0FBE/574-specifying-a-schema-for-revisions-in-hadoop

2. added inelegantly worded card to new ideas lane of R&D backlog board:

https://trello.com/c/TocTUcD7/206-solve-problems-similiar-to-ones-surfaced-when-developing-a-schema-for-processing-revisions-in-hadoop-discovering-namespace-issue

For Andrew:
3. productionizing xmldump -> avro jobs:
https://phabricator.wikimedia.org/T78404

4. for experimentation part, I created this and called it out as spike:

https://phabricator.wikimedia.org/T78405





On Thu, Dec 11, 2014 at 2:48 PM, Andrew Otto <[email protected]> wrote:

> Right now, I am working on experimenting with importing Revision history
> from XML dumps into an easier to use format, Avro.  This new format
> requires a schema definition.  We are considering the pros and cons of
> sticking close to older schemas, or creating new cleaner ones.  For the
> most part these are just discussions around field names, but there are also
> times when flattening fields makes more sense (e.g. redirect_title vs
> redirect.title, since <redirect title=“blah”/>  is how the field looks in
> XML).  Data structure changes aren’t out of the question.
>
> There isn’t a card, because on my end this is still experimentation.  I’m
> trying to come up with something that Aaron can use easily, so my stuff has
> to work with his code.  Hence the collaboration.
>
> But!  If we settle on this, then I will create cards for productionizing
> xmldump -> avro jobs.  Those will certainly cover this issue.
>
> Also:  YEAH FOR GOOD NAMING! GO DAN!  Don’t listen to those bikeshedhaters!
>
> -Ao
>
>
> On Dec 11, 2014, at 17:23, Aaron Halfaker <[email protected]> wrote:
>
> Good question.  I don't know if there is a desired outcome of this
> conversation.  My purpose in starting this thread was to have a discussion
> about the problem we face so that we can start thinking better about it.
>
> I don't think I have a task set up for "specifying a schema for revisions
> in hadoop".  The closest bit we have on the R&D board is
> https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity --
> which is the immediate goal of what I'm working on with Andrew right now.
> A more long-term goal would be to solve similar problems more easily in the
> future.
>
> -Aaron
>
> On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <[email protected]
> > wrote:
>
>> I'd like to put a placeholder in Phab or Trello for this work, but please
>> help me out because I am still new....could someone help summarize the
>> context and what we are trying solve?
>>
>> Also, would this go into Research, Eng or Refinery backlog?
>>
>> Thanks!
>>
>> On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu <[email protected]>
>> wrote:
>>
>>> Bikeshed indeed -- this seems to be a project that could soak up a lot
>>>> of time. I'm with Aaron -- let's be consistent with the principle of least
>>>> surprise and use an existing identifier. The database seems as good a place
>>>> to start as any.
>>>>
>>>
>>> I disagree that this is bikeshedding.  The reason people look back after
>>> a year at a project and go "yuck, wish we named those things differently"
>>> is precisely because this type of effort is incorrectly labeled as
>>> bikeshedding.  We are *not* talking a bout a bike shed.  We're talking
>>> about a schema that will hopefully serve hundreds or thousands of
>>> researchers and our own growing team (I'm considering both Aaron's revision
>>> schema and the data warehouse schema).
>>>
>>>
>>>> So, I'm not sure that is necessary for the term "identifier" which I
>>>>> assume that "id" abbreviates.  Regardless it seems clear that these 
>>>>> numbers
>>>>> are thought of as primary identifiers of a namespace that can otherwise
>>>>> have many names.  For example, see this snippet from the result of this
>>>>> query:
>>>>> http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=jsonfm
>>>>>
>>>>> "1": {
>>>>>                 "id": 1,
>>>>>                 "case": "first-letter",
>>>>>                 "*": "Discusi\u00f3n",
>>>>>                 "subpages": "",
>>>>>                 "canonical": "Talk"
>>>>>
>>>>> },
>>>>>
>>>>
>>> Fair enough, namespace_id seems like a good name for a property of a
>>> page entity then.
>>>
>>>
>>>> I don't see us getting rid of legacy naming right now.  I don't see how
>>>>> adding a new name helps anyone -- veteran or newbie.
>>>>>
>>>>
>>> I disagree that we have to care at all about legacy names.  I disagree
>>> that the principle of least surprise leads one to prefer database names.
>>> To me, that's more surprising because database conventions have no place in
>>> json.  If I was new to this world, it also seems more surprising.  If I was
>>> an existing user, I don't think I would be at all surprised as long as the
>>> names were clear and the schemas well documented.  This page_namespace_id
>>> is a bit of a red herring because we have harder things to tackle like
>>> "restrictions".
>>>
>>>
>>>> However, if we were to develop a mapping of canonical names and pursue
>>>>> that from here forward, we might be able to move beyond the old names for
>>>>> the most important data sources in a few of years.   However, I'm 
>>>>> skeptical
>>>>> that we'll ever be able to change any production DB field names.
>>>>>
>>>>
>>> We need not be tied to the production db names.  The data warehouse
>>> effort is trying to transform a confusing schema riddled with
>>> idiosyncrasies into a clean, easy to understand, and easy to work with,
>>> dimensional model.  In the process, we are also trying to capture changes
>>> to objects over time so we are greatly expanding the usefulness of the
>>> database.  Good naming matters and we should take our time.
>>>
>>> _______________________________________________
>>> Analytics mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>>
>>
>> _______________________________________________
>> Analytics mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>
>
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
>

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] The state of field names in MediaWiki data

Reply via email to