Leigh Dodds wrote:
Hi,
On Sunday, March 28, 2010, Kingsley Idehen <kide...@openlinksw.com> wrote:
All,
A very nice data cleansing tool from David and Co. at Freebase.
Yes, it looks very nice. Am looking forward to working with it.
CSVs are clearly the dominant data format in the structured open data
realm. This tool deals with ETL very well. Of course, for those who
appreciate OWL, a lot of what's demonstrated in this demo is also
achievable via "context rules".
Can you (or others) expand on that?
Much of the power in the demo seemed to me to be in the facetting,
scripting of cleansing, analysis of value spaces, etc.
I'd be interested to know how OWL could be applied here.
Cheers,
L.
Leigh,
OWL comes in post load of the data into the Quad Store (clean or dirty).
Note, this demo is based on Literal values cleansing. When you have data
object identifiers in play you aren't confined to joining data via
Literal Values (key difference between RDBMS realm and RDF and other
Graph Model realms).
1. Co-reference - via owl:sameAs assertions
2. Dirty Data - use of procedure functions and inverse functional
properties
3. Units of Measurement - leveraging locale prowess of HTTP re. ability
to identify locale of user agents combined with TCN QoS algorithms
(which can be part of SPARQL as we've done re. Virtuoso)
You can make rules that incorporate all of the above, you can even do so
with SPARQL (plus function/magic predicates) as the Rules Language for
constrained forward-chaining in more extreme cases.
I can load a dirty CSV file into Virtuoso, and leverage OWL, SPARQL,
Function/Magic Predicates en route to handling:
1. Semantic Disparity
2. Structural Disparity
3. Entity Co-References.
Naturally, someone could, and eventually would, write a data
reconciliation tool that looked like Microsoft Access and basically
delivered delivered on the above, while simply ridding Virtuoso engines
(ditto any other Quad Store with similar capabilities). Its all going to
happen quicker than most will expect, especially now that OData is part
of the mix re. granular structured linked data, and the universal nature
of the Entity-Attribute-Value model is getting clearer to broader
audiences by the second :-)
Links:
1. http://bit.ly/csFCqC -- Data Reconciliation using TimBL as subject
(note the co-reference and indirect-coference tab data which offers a
teaser) .
--
Regards,
Kingsley Idehen
President & CEO
OpenLink Software
Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen