Leigh Dodds wrote:
Hi,

On Sunday, March 28, 2010, Kingsley Idehen <kide...@openlinksw.com> wrote:
All,

A very nice data cleansing tool from David and Co. at Freebase.

Yes, it looks very nice. Am looking forward to working with it.

CSVs are clearly the dominant data format in the structured open data
realm. This tool deals with ETL very well. Of course, for those who
appreciate OWL, a lot of what's demonstrated in this demo is also
achievable via "context rules".

Can you (or others) expand on that?

Much of the power in the demo seemed to me to be in the facetting,
scripting of cleansing, analysis of value spaces, etc.

I'd be interested to know how OWL could be applied here.

Cheers,

L.

Leigh,

OWL comes in post load of the data into the Quad Store (clean or dirty). Note, this demo is based on Literal values cleansing. When you have data object identifiers in play you aren't confined to joining data via Literal Values (key difference between RDBMS realm and RDF and other Graph Model realms).

1. Co-reference - via owl:sameAs assertions
2. Dirty Data - use of procedure functions and inverse functional properties 3. Units of Measurement - leveraging locale prowess of HTTP re. ability to identify locale of user agents combined with TCN QoS algorithms (which can be part of SPARQL as we've done re. Virtuoso)

You can make rules that incorporate all of the above, you can even do so with SPARQL (plus function/magic predicates) as the Rules Language for constrained forward-chaining in more extreme cases.

I can load a dirty CSV file into Virtuoso, and leverage OWL, SPARQL, Function/Magic Predicates en route to handling:

1. Semantic Disparity
2. Structural Disparity
3. Entity Co-References.

Naturally, someone could, and eventually would, write a data reconciliation tool that looked like Microsoft Access and basically delivered delivered on the above, while simply ridding Virtuoso engines (ditto any other Quad Store with similar capabilities). Its all going to happen quicker than most will expect, especially now that OData is part of the mix re. granular structured linked data, and the universal nature of the Entity-Attribute-Value model is getting clearer to broader audiences by the second :-)

Links:

1. http://bit.ly/csFCqC -- Data Reconciliation using TimBL as subject (note the co-reference and indirect-coference tab data which offers a teaser) .

--

Regards,

Kingsley Idehen President & CEO OpenLink Software Web: http://www.openlinksw.com
Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca: kidehen





Reply via email to