Good question. I don't know if there is a desired outcome of this conversation. My purpose in starting this thread was to have a discussion about the problem we face so that we can start thinking better about it.
I don't think I have a task set up for "specifying a schema for revisions in hadoop". The closest bit we have on the R&D board is https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- which is the immediate goal of what I'm working on with Andrew right now. A more long-term goal would be to solve similar problems more easily in the future. -Aaron On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <[email protected]> wrote: > I'd like to put a placeholder in Phab or Trello for this work, but please > help me out because I am still new....could someone help summarize the > context and what we are trying solve? > > Also, would this go into Research, Eng or Refinery backlog? > > Thanks! > > On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu <[email protected]> > wrote: > >> Bikeshed indeed -- this seems to be a project that could soak up a lot of >>> time. I'm with Aaron -- let's be consistent with the principle of least >>> surprise and use an existing identifier. The database seems as good a place >>> to start as any. >>> >> >> I disagree that this is bikeshedding. The reason people look back after >> a year at a project and go "yuck, wish we named those things differently" >> is precisely because this type of effort is incorrectly labeled as >> bikeshedding. We are *not* talking a bout a bike shed. We're talking >> about a schema that will hopefully serve hundreds or thousands of >> researchers and our own growing team (I'm considering both Aaron's revision >> schema and the data warehouse schema). >> >> >>> So, I'm not sure that is necessary for the term "identifier" which I >>>> assume that "id" abbreviates. Regardless it seems clear that these numbers >>>> are thought of as primary identifiers of a namespace that can otherwise >>>> have many names. For example, see this snippet from the result of this >>>> query: >>>> http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=jsonfm >>>> >>>> "1": { >>>> "id": 1, >>>> "case": "first-letter", >>>> "*": "Discusi\u00f3n", >>>> "subpages": "", >>>> "canonical": "Talk" >>>> >>>> }, >>>> >>> >> Fair enough, namespace_id seems like a good name for a property of a page >> entity then. >> >> >>> I don't see us getting rid of legacy naming right now. I don't see how >>>> adding a new name helps anyone -- veteran or newbie. >>>> >>> >> I disagree that we have to care at all about legacy names. I disagree >> that the principle of least surprise leads one to prefer database names. >> To me, that's more surprising because database conventions have no place in >> json. If I was new to this world, it also seems more surprising. If I was >> an existing user, I don't think I would be at all surprised as long as the >> names were clear and the schemas well documented. This page_namespace_id >> is a bit of a red herring because we have harder things to tackle like >> "restrictions". >> >> >>> However, if we were to develop a mapping of canonical names and pursue >>>> that from here forward, we might be able to move beyond the old names for >>>> the most important data sources in a few of years. However, I'm skeptical >>>> that we'll ever be able to change any production DB field names. >>>> >>> >> We need not be tied to the production db names. The data warehouse >> effort is trying to transform a confusing schema riddled with >> idiosyncrasies into a clean, easy to understand, and easy to work with, >> dimensional model. In the process, we are also trying to capture changes >> to objects over time so we are greatly expanding the usefulness of the >> database. Good naming matters and we should take our time. >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
