Re: [Analytics] The state of field names in MediaWiki data

Andrew Otto Wed, 10 Dec 2014 13:17:57 -0800

Are you suggesting we buck any ugliness of the xml field names and choose the 
most consistent and elegant ones we can think of?!  :D :D



> On Dec 10, 2014, at 16:07, Dan Andreescu <[email protected]> wrote:
> 
> I think naming things like projects and repositories and folders can be 
> tricky.  I don't think naming schema fields should be very tricky.  Problems 
> with names in schemas usually reflect limitations of the technologies 
> involved.  From your example:
> 
> database has page.page_namespace.  This is mostly for clarity in SQL 
> statements.  The name of the table is duplicated in the name of the field so 
> you can make sense of fields across joins and complicated subqueries.
> 
> javascript has wgNamespaceNumber.  Looks like a convention dictated this, but 
> luckily it's fairly isolated from research work so we can ignore such things.
> 
> XML has <page><ns>.  This is the closest to free of idiosyncrasy, but ns 
> should be namespace and it probably isn't to conserve space in dumps (which 
> can get large)
> 
> Finally we're considering page_namespace_id.  I disagree and I can make an 
> objective argument.  We're going to use a json object to represent this data. 
>  It should therefore be:
> 
> { page: { namespace: 0 } }
> 
> There is no namespace table, and so the namespace is not an id.  It's a 
> number that means different things based on configuration in different wikis. 
>  If we decide to make a namespace entity with (wiki, number, description) 
> properties, then it would be ok to have:
> 
> { page: { namespace_id: 0 } }
> 
> 
> As a side note, naming matters for our data warehouse as well.  I say we 
> don't limit ourselves with tool idiosyncrasies.  Instead, let's come up with 
> names that make sense.  Veteran researchers can rid themselves of the pain of 
> old names, but new researchers shouldn't have to deal with legacy naming.  
> And hopefully for the veterans out there, the structure of the json document 
> is enough to make up for the new approach.
> 
> On Wed, Dec 10, 2014 at 1:22 PM, Aaron Halfaker <[email protected] 
> <mailto:[email protected]>> wrote:
> Hey folks,
> 
> I was talking to ottomata today about developing a schema for processing 
> revisions in Hadoop.  We came across a deep problem with field names that I'd 
> like to discuss because I want people to be aware of the problem.  
> 
> To explain this, I'll use an example.  Let's say you want to get the 
> namespace of this page:
> https://en.wikipedia.org/wiki/Biology <https://en.wikipedia.org/wiki/Biology>
> 
> In javascript, this is represented as the variable wgNamespaceNumber.
> 
> In the database, this is represented as page.page_namespace
> 
> In the XML database dump, this is represented as the value at <page><ns> or 
> <namespaces><namespace.key> depending where you are.
> 
> Right now, ottomata and I are considering the more descriptive name 
> page_namespace_id since the value of all of these valiables/fields is an 
> identifier -- not a name.   I think that this is a *good* name if we consider 
> it in a vacuum, but if we choose it, we'll add yet another name for wiki devs 
> & analysts to be aware of.
> 
> Given the context of this decision, my instinct is to choose the least 
> surprising name.  Since I mostly work with the database, that would mean I'd 
> choose page_namespace.
> 
> This is just one example of such nonsense.  The decisions we make in formats 
> that we produce now can have immeasurable effects on the sanity of others.  I 
> hope that the decisions we make today will minimize such pain, but it's hard 
> to know for sure.  
> 
> -Aaron
> 
> _______________________________________________
> Analytics mailing list
> [email protected] <mailto:[email protected]>
> https://lists.wikimedia.org/mailman/listinfo/analytics 
> <https://lists.wikimedia.org/mailman/listinfo/analytics>
> 
> 
> _______________________________________________
> Analytics mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/analytics

_______________________________________________
Analytics mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] The state of field names in MediaWiki data

Reply via email to