I appreciate Dan's passion and tenacity. I barely know what y'all are talking about, but I can tell that I support his commitment to good naming. Thanks, Dan!
For everything else, I created tickets to capture what you are working on now and what should go in the backlog. Feel free to correct or wordsmith any errors. For Aaron: 1. added to in progress on R & D Trello board: https://trello.com/c/Yuki0FBE/574-specifying-a-schema-for-revisions-in-hadoop 2. added inelegantly worded card to new ideas lane of R&D backlog board: https://trello.com/c/TocTUcD7/206-solve-problems-similiar-to-ones-surfaced-when-developing-a-schema-for-processing-revisions-in-hadoop-discovering-namespace-issue For Andrew: 3. productionizing xmldump -> avro jobs: https://phabricator.wikimedia.org/T78404 4. for experimentation part, I created this and called it out as spike: https://phabricator.wikimedia.org/T78405 On Thu, Dec 11, 2014 at 2:48 PM, Andrew Otto <[email protected]> wrote: > Right now, I am working on experimenting with importing Revision history > from XML dumps into an easier to use format, Avro. This new format > requires a schema definition. We are considering the pros and cons of > sticking close to older schemas, or creating new cleaner ones. For the > most part these are just discussions around field names, but there are also > times when flattening fields makes more sense (e.g. redirect_title vs > redirect.title, since <redirect title=“blah”/> is how the field looks in > XML). Data structure changes aren’t out of the question. > > There isn’t a card, because on my end this is still experimentation. I’m > trying to come up with something that Aaron can use easily, so my stuff has > to work with his code. Hence the collaboration. > > But! If we settle on this, then I will create cards for productionizing > xmldump -> avro jobs. Those will certainly cover this issue. > > Also: YEAH FOR GOOD NAMING! GO DAN! Don’t listen to those bikeshedhaters! > > -Ao > > > On Dec 11, 2014, at 17:23, Aaron Halfaker <[email protected]> wrote: > > Good question. I don't know if there is a desired outcome of this > conversation. My purpose in starting this thread was to have a discussion > about the problem we face so that we can start thinking better about it. > > I don't think I have a task set up for "specifying a schema for revisions > in hadoop". The closest bit we have on the R&D board is > https://trello.com/c/3Uwlwoxk/548-q2-measuring-quality-productivity -- > which is the immediate goal of what I'm working on with Andrew right now. > A more long-term goal would be to solve similar problems more easily in the > future. > > -Aaron > > On Thu, Dec 11, 2014 at 2:13 PM, Grace Gellerman <[email protected] > > wrote: > >> I'd like to put a placeholder in Phab or Trello for this work, but please >> help me out because I am still new....could someone help summarize the >> context and what we are trying solve? >> >> Also, would this go into Research, Eng or Refinery backlog? >> >> Thanks! >> >> On Thu, Dec 11, 2014 at 1:52 PM, Dan Andreescu <[email protected]> >> wrote: >> >>> Bikeshed indeed -- this seems to be a project that could soak up a lot >>>> of time. I'm with Aaron -- let's be consistent with the principle of least >>>> surprise and use an existing identifier. The database seems as good a place >>>> to start as any. >>>> >>> >>> I disagree that this is bikeshedding. The reason people look back after >>> a year at a project and go "yuck, wish we named those things differently" >>> is precisely because this type of effort is incorrectly labeled as >>> bikeshedding. We are *not* talking a bout a bike shed. We're talking >>> about a schema that will hopefully serve hundreds or thousands of >>> researchers and our own growing team (I'm considering both Aaron's revision >>> schema and the data warehouse schema). >>> >>> >>>> So, I'm not sure that is necessary for the term "identifier" which I >>>>> assume that "id" abbreviates. Regardless it seems clear that these >>>>> numbers >>>>> are thought of as primary identifiers of a namespace that can otherwise >>>>> have many names. For example, see this snippet from the result of this >>>>> query: >>>>> http://es.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=namespaces|namespacealiases&format=jsonfm >>>>> >>>>> "1": { >>>>> "id": 1, >>>>> "case": "first-letter", >>>>> "*": "Discusi\u00f3n", >>>>> "subpages": "", >>>>> "canonical": "Talk" >>>>> >>>>> }, >>>>> >>>> >>> Fair enough, namespace_id seems like a good name for a property of a >>> page entity then. >>> >>> >>>> I don't see us getting rid of legacy naming right now. I don't see how >>>>> adding a new name helps anyone -- veteran or newbie. >>>>> >>>> >>> I disagree that we have to care at all about legacy names. I disagree >>> that the principle of least surprise leads one to prefer database names. >>> To me, that's more surprising because database conventions have no place in >>> json. If I was new to this world, it also seems more surprising. If I was >>> an existing user, I don't think I would be at all surprised as long as the >>> names were clear and the schemas well documented. This page_namespace_id >>> is a bit of a red herring because we have harder things to tackle like >>> "restrictions". >>> >>> >>>> However, if we were to develop a mapping of canonical names and pursue >>>>> that from here forward, we might be able to move beyond the old names for >>>>> the most important data sources in a few of years. However, I'm >>>>> skeptical >>>>> that we'll ever be able to change any production DB field names. >>>>> >>>> >>> We need not be tied to the production db names. The data warehouse >>> effort is trying to transform a confusing schema riddled with >>> idiosyncrasies into a clean, easy to understand, and easy to work with, >>> dimensional model. In the process, we are also trying to capture changes >>> to objects over time so we are greatly expanding the usefulness of the >>> database. Good naming matters and we should take our time. >>> >>> _______________________________________________ >>> Analytics mailing list >>> [email protected] >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>> >> >> _______________________________________________ >> Analytics mailing list >> [email protected] >> https://lists.wikimedia.org/mailman/listinfo/analytics >> >> > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics > >
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
