Re: [uk-government-data-developers] Nice Data Cleansing Tool Demo

Kingsley Idehen Sun, 28 Mar 2010 13:04:41 -0700

Leigh Dodds wrote:

Hi,


On Sunday, March 28, 2010, Kingsley Idehen <[email protected]> wrote:

All,

A very nice data cleansing tool from David and Co. at Freebase.


Yes, it looks very nice. Am looking forward to working with it.

CSVs are clearly the dominant data format in the structured open data
realm. This tool deals with ETL very well. Of course, for those who
appreciate OWL, a lot of what's demonstrated in this demo is also
achievable via "context rules".


Can you (or others) expand on that?

Much of the power in the demo seemed to me to be in the facetting,
scripting of cleansing, analysis of value spaces, etc.

I'd be interested to know how OWL could be applied here.

Cheers,

L.

Leigh,

OWL comes in post load of the data into the Quad Store (clean or dirty).Note, this demo is based on Literal values cleansing. When you have dataobject identifiers in play you aren't confined to joining data viaLiteral Values (key difference between RDBMS realm and RDF and otherGraph Model realms).


1. Co-reference - via owl:sameAs assertions

2. Dirty Data - use of procedure functions and inverse functionalproperties3. Units of Measurement - leveraging locale prowess of HTTP re. abilityto identify locale of user agents combined with TCN QoS algorithms(which can be part of SPARQL as we've done re. Virtuoso)

You can make rules that incorporate all of the above, you can even do sowith SPARQL (plus function/magic predicates) as the Rules Language forconstrained forward-chaining in more extreme cases.

I can load a dirty CSV file into Virtuoso, and leverage OWL, SPARQL,Function/Magic Predicates en route to handling:


1. Semantic Disparity
2. Structural Disparity
3. Entity Co-References.

Naturally, someone could, and eventually would, write a datareconciliation tool that looked like Microsoft Access and basicallydelivered delivered on the above, while simply ridding Virtuoso engines(ditto any other Quad Store with similar capabilities). Its all going tohappen quicker than most will expect, especially now that OData is partof the mix re. granular structured linked data, and the universal natureof the Entity-Attribute-Value model is getting clearer to broaderaudiences by the second :-)


Links:

1. http://bit.ly/csFCqC -- Data Reconciliation using TimBL as subject(note the co-reference and indirect-coference tab data which offers ateaser) .


--

Regards,

Kingsley IdehenPresident & CEOOpenLink SoftwareWeb: http://www.openlinksw.com

Weblog: http://www.openlinksw.com/blog/~kidehen

Twitter/Identi.ca: kidehen

Re: [uk-government-data-developers] Nice Data Cleansing Tool Demo

Reply via email to