Re: Take2: 15 Ways to Think About Data Quality (Just for a Start)

Kingsley Idehen Wed, 20 Apr 2011 13:17:10 -0700

On 4/15/11 9:47 AM, glenn mcdonald wrote:

This reminds me to come back to the point about what I initially
called Directionality, and Dave improved to Modeling Consistency.


Dave is right, I think, that in terms of data quality, it is
consistency that matters, not directionality. That is, as long as we
know that a president was involved in a presidency, it doesn't matter
whether we know that because the president linked to the presidency,
or the presidency linked to the president. In fact, in a relational
database the president and the presidency and the link might even be
in three separate tables. From a data-mathematical perspective, it
doesn't matter. All of these are ways of expressing the same logical
construct. We just want it to be done the same way for all
presidents/presidencies/links.

But although directionality is immaterial for data *quality*, it
matters quite a bit for the usability of the system in which the data
reaches people. We know, for example, that in the real world
presidents have presidencies, and vice versa. But think about what it
takes to find out whether this information is represented in a given
dataset:

- In a classic SQL-style relational database we probably have to just
know the schema, as there's usually no exploratory way to find this
kind of thing out. The RDBMS formalism doesn't usually represent the
relationships between tables. You not only have to know it from
external sources, but you have to restate it in each SQL join-query.
This may be acceptable in a database with only a few tables, where the
field-headings are kept consistent by convention, but it's extremely
problematic when you're trying to combine formerly-separate datasets
into large ones with multiple dimensions and purposes. If the LOD
cloud were in relational tables, it would be awful. Arguably the main
point of the cloud is to get the data out of relational tables (where
most of it probably originates) into a graph where the connections are
actually represented instead of implied.

Sorta. There is more to it re. Linked Data though. For instance, theobject ids resolve to actual object representations via time testedde-reference (*) and address-of (&) style operator patterns via HTTP URIbased Names and HTTP URI based Data Access Addresses (URLs), respectively.

- But even in RDF, directionality poses a significant discovery
problem.


Yes, assuming a single document with RDF content.

In a minimal graph (let's say "minimal graph" means that each
relationship is asserted in only one direction, so there's no
relationship redundancy), you can't actually explore the data
navigationally. You can't go to a single known point of interest, like
a given president, and explore to find out everything the data holds
and how it connects.

Well this is an aspect of most of LOD cloud cache demonstrations I putout. Given a Text Pattern, Entity Label, and URI, place me somewhere sothat I can disambiguate my way to what I seek by navigating across isAand other relations that constitute the underlying Linked Data graph.


Thus, in our case it could be:

1. Pattern: "Obama"
2. Pattern: "Obama" in the Entity label
3. Actual known ID (URI) for a given Entity.

You can explore the *outward* relationships from
any given point, but to find out about the *inward* relationships you
have to keep doing new queries over the entire dataset.

Yes, and not only that, you need to be able to allow the user pagethrough the data using scrollable cursoring techniques. An old DBMStechnique for handling voluminous result sets. Thus, you should be ableto go to specific pages or a specific position, and then bookmark saidposition for future reference etc..

The same basic
issue applies to an XML representation of the data as a tree: you can
squirrel your way down, but only in the direction the original modeler
decided was "down". If you need a different direction, you have to
hire a hypersquirrel.

Yes, but XML is a rooted graph. Thus, XML ingested into a graph storeresults in a relational graph. The important thing is the Entity IDhandling post ingestion.

- Of course, most RDF-presenting systems recognize this as a usability
problem, and address it by turning the minimal graph into a redundant
graph for UI purposes.

Not necessarily redundant when persisted and indexed in a relationalproperty graph model DBMS. As per comment above, it ultimately boilsdown to the semantics expressed in the resulting graph. XML data sourcesas foundation for Linked Data graphs is something that underlies oursponger middleware and various cartridges. The cartridge effort is wherethe modeling occurs based on schema study and eventual remapping.

Thus in a data-browser UI you usually see, for
a given node, lists of both outward and inward relationships. This is
better, but if this abstraction is done at the UI layer, you still
lose it once you drop down into the SPARQL realm.

SPARQL realm should be about producing results for different consumers.If you are constructing a view for a user where graph position placementis one of the UX goals, then surfacing the Linked Data URIs in theresult set works fine. Again, its one of the things I've beendemonstrating since our initial ODE browser and iSPARQL QBE, both dateback to 2007. What's newer is a set of interfaces that handle cursorbased navigation over massive datasets stored in the Virtuoso DBMS. Thebrowser won't explode, in a nutshell.

  This makes the
SPARQL queries harder to write, because you can't write them the way
you logically think about the question, you have to write them the way
the data thinks about the question.


Depends on the writer :-)

It also why we have a SPARQL link in place to show you what's beinggenerated when you start with text patterns in our faceted navigation UI.

And this skew from real logic to
directional logic can make them *much* harder to understand or
maintain, because the directionality obscures the purpose and reduces
the self-documenting nature of the query.


Yes.


All of this is *much* better, in usability terms, if the data is
redundantly, bi-directionally connected all the way down to the level
of abstraction at which you're working. Now you can explore to figure
out what's there, and you can write your queries in the way that makes
the most human sense. The artificicial skew between the logical
structure and the representational structure has been removed. This is
perfectly possible in an RDF-based system, of course, if the software
either generates or infers the missing inverses.


Yes, and that's what we do. And it works at massive scale.

We incur extra
machine overhead to reduce the human congnitive burden. I contend this
should be considered a nearly-mandatory best-practice for linked data,
and that propogating inverses around the LOD cloud ought to be one of
things that makes the LOD cloud *a thing*, rather than just a
collection of logical silos.

Yes, and that's what we believe too, and have executed on that via theLOD cloud cache we maintain.

On a related note, re. data quality matters in general, some excerptsfrom an 2009 post about data quality [1]:


“You don’t talk about data quality.”

No, wait—that’s The First Rule of Poor Quality Data.

The First Law of Data Quality:

“Data is either being used or waiting to be used—or wasting storage andsupport.”Although understanding your data is essential to using it effectivelyand improving its quality, as Thomas Redman explains, “it is a waste ofeffort to improve the quality of data no one ever uses.”

In the context of Linked Data surmounting the essence of the above hasbeen our focal point from day one. The data has to be out there forquality issues to surface albeit subjectively.



Link:

1. http://www.dataroundtable.com/?p=1458

--

Regards,

Kingsley Idehen 
President&  CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen

Re: Take2: 15 Ways to Think About Data Quality (Just for a Start)

Reply via email to