Re: 15 Ways to Think About Data Quality (Just for a Start)

Kingsley Idehen Tue, 12 Apr 2011 06:01:49 -0700

On 4/8/11 9:10 PM, glenn mcdonald wrote:

I don't think data quality is an amorphous, aesthetic, hopelesslysubjective topic. Data "beauty" might be subjective, and the same datamay have different applicability to different tasks, but there are alot of obvious and straightforward ways of thinking about the qualityof a dataset independent of the particular preferences of individualbeholders. Here are just some of them:

Glenn,

I (and others) have no issue with data quality, we just understand(first hand) that when you have a masses of data from disparate sources,you discuss and iterate your way subjective sanity via constructivefeedback loops. Summarily conflating source data quality with dataaccess and presentation oriented tools is simply wrong, we all careabout data quality, but nothing in the world nullifies the fact that"quality" is subjective. Is Excel rendered useless because a list ofcountries with obvious errors was presented in the spreadsheet? To anaudience of Spreadsheet developers (programmers making a Spreadsheetproduct) that's irrelevant, to the accounts or marketing department ofSpreadsheet product customers (actual users doing their jobs) that'simportant, but it has nothing to do with the Spreadsheet product itself.Same analogy would apply to any DBMS product. You have to separate theparts is the message I keep on trying to relay to you i.e., stopconflating matters in an unnecessarily disruptive way.


Back to data quality discussion:

Subjectively low quality data can lead to subjectively higher qualitydata. Without data all you have is an empty space. Using any form of"all or nothing" proposition in a subjective realm is fatally flawed.

How would you address data quality issues in situations where dataproducers, data shape, data consumers, and data presentation tools areall loosely coupled ? Bearing in mind your issues with DBpedia and otherdatasets from the LOD cloud, are contributions of quality data from youout of the question re., virtuous cycle that's oriented towardssubjectively improved quality?

I've already made it clear to you that DBpedia contributions arewelcome, they trump gripping any day, and you would actually be quitesurprised as to what kind discourse clarity said contributions wouldunveil. Thus, why don't you call my bluff by producing and sharing a"data quality" linkset for the LOD cloud?

Note FAO, SUMO, Yago, UMBEL, OpenCyc communities have all contributeddata to the LOD cloud that enable application of their context lenses tolinked open data spaces like DBpedia. I spend a lot of time behind thescenes working with a variety of people on the very subject of dataquality, linkset partitioning via named graphs, and conditionalapplication of inference contexts via the combination of rules andreasoners. Unfortunately, you are so bent on obliterating the start ofconversations that you don't even recognize different routes to the samedestination.

As for reconciling a common Referent for multiple Identifiers in aLinked Data space comprised of 21 Billion+ triples, lets take a look atthe subject: Michael Jackson .

1.http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson-- basic description of 'Micheal Jackson' from DBpedia

2.http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson-- list of source named graphs in the host DBMS

3.http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=2-- list of named graphs with triples that reference this subject

4.http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=3-- explicit owl:sameAs relations across the entire DBMS (clicking oneach Identifier will unveil the description graph for the Referent ofsaid Identifier)

5.http://lod.openlinksw.com/fct/rdfdesc/usage.vsp?g=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&tp=4-- use of an InverseFunctionalProperty based rule to generate a fuzzylist of Identifiers that potentially share the same Referent (click oneach link as per prior step)

6.http://lod.openlinksw.com/describe/?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes-- inference context enhanced description of 'Micheal Jackson' (this isa union expansion of all properties across all Identifiers in anowl:sameAs relation with DBpedia Entity, hence use of paging re.handling result set size.)

7.http://lod.openlinksw.com/describe/?url=http%3A%2F%2Fdbpedia.org%2Fresource%2FMichael_Jackson&sas=yes&p=6&lp=7&op=4&prev=&gp=6- Page 5 of 8 re. enhanced description of 'Micheal Jackson' .

Steps 1-7 can provide many insights about data the aid subjectivequality fixes via simple protocols such as consumer notifying publisherand in the very worst of cases (agreeing to disagree) the consumer makesa linkset, passes it on to the producer, and the producer reciprocatesby uploading the linkset to a named graph and they also publishes anamed rule such that when consumer next visits they are able to applytheir subjective "context lenses" to the data via inference rules. Allof this happens without imposing 'world views' on any other consumers ofthe data who's needs by vary, subjectively.

The process I outline above is something we do regularly re. thedatasets hosted in the public instances we oversee. Its why we actuallyhave a number of demo rules etc..

Accepting the complexity of subjectivity when the audience diversity isintegral to a system != ignoring or dismissing the value of dataquality. I just also happen to have hands on experience dealing thisproblem and its inherently subjectivity.

To conclude, your quality factors aren't invalid, the real challenge andquestion for you is this: how do you cater for this at InterWeb scalebearing in mind audience heterogeneity?


--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: 15 Ways to Think About Data Quality (Just for a Start)

Reply via email to