Are you suggesting we buck any ugliness of the xml field names and choose the most consistent and elegant ones we can think of?! :D :D
> On Dec 10, 2014, at 16:07, Dan Andreescu <[email protected]> wrote: > > I think naming things like projects and repositories and folders can be > tricky. I don't think naming schema fields should be very tricky. Problems > with names in schemas usually reflect limitations of the technologies > involved. From your example: > > database has page.page_namespace. This is mostly for clarity in SQL > statements. The name of the table is duplicated in the name of the field so > you can make sense of fields across joins and complicated subqueries. > > javascript has wgNamespaceNumber. Looks like a convention dictated this, but > luckily it's fairly isolated from research work so we can ignore such things. > > XML has <page><ns>. This is the closest to free of idiosyncrasy, but ns > should be namespace and it probably isn't to conserve space in dumps (which > can get large) > > Finally we're considering page_namespace_id. I disagree and I can make an > objective argument. We're going to use a json object to represent this data. > It should therefore be: > > { page: { namespace: 0 } } > > There is no namespace table, and so the namespace is not an id. It's a > number that means different things based on configuration in different wikis. > If we decide to make a namespace entity with (wiki, number, description) > properties, then it would be ok to have: > > { page: { namespace_id: 0 } } > > > As a side note, naming matters for our data warehouse as well. I say we > don't limit ourselves with tool idiosyncrasies. Instead, let's come up with > names that make sense. Veteran researchers can rid themselves of the pain of > old names, but new researchers shouldn't have to deal with legacy naming. > And hopefully for the veterans out there, the structure of the json document > is enough to make up for the new approach. > > On Wed, Dec 10, 2014 at 1:22 PM, Aaron Halfaker <[email protected] > <mailto:[email protected]>> wrote: > Hey folks, > > I was talking to ottomata today about developing a schema for processing > revisions in Hadoop. We came across a deep problem with field names that I'd > like to discuss because I want people to be aware of the problem. > > To explain this, I'll use an example. Let's say you want to get the > namespace of this page: > https://en.wikipedia.org/wiki/Biology <https://en.wikipedia.org/wiki/Biology> > > In javascript, this is represented as the variable wgNamespaceNumber. > > In the database, this is represented as page.page_namespace > > In the XML database dump, this is represented as the value at <page><ns> or > <namespaces><namespace.key> depending where you are. > > Right now, ottomata and I are considering the more descriptive name > page_namespace_id since the value of all of these valiables/fields is an > identifier -- not a name. I think that this is a *good* name if we consider > it in a vacuum, but if we choose it, we'll add yet another name for wiki devs > & analysts to be aware of. > > Given the context of this decision, my instinct is to choose the least > surprising name. Since I mostly work with the database, that would mean I'd > choose page_namespace. > > This is just one example of such nonsense. The decisions we make in formats > that we produce now can have immeasurable effects on the sanity of others. I > hope that the decisions we make today will minimize such pain, but it's hard > to know for sure. > > -Aaron > > _______________________________________________ > Analytics mailing list > [email protected] <mailto:[email protected]> > https://lists.wikimedia.org/mailman/listinfo/analytics > <https://lists.wikimedia.org/mailman/listinfo/analytics> > > > _______________________________________________ > Analytics mailing list > [email protected] > https://lists.wikimedia.org/mailman/listinfo/analytics
_______________________________________________ Analytics mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/analytics
