Re: Where are the Linked Data Driven Smart Agents (Bots) ?

2016-07-08 Thread Paul Houle
I would look at the history of the conventional web client as a parallel to
the "semantic web."

Not long after Netscape,  it was clear that a "personal web crawler" was a
possible thing that a person could use to answer some question.  Rather
than being dependent on InfoSeek or Altavista,  you could get a much deeper
understanding than you get from Google which is so oriented to P@1.

There are things like w3mir and httrack,  and wizards code up task-focused
web crawlers all the time,  but you don't see a lot in the way for tools
for ordinary computer users to say,  crawl out the web site of some place
like BlackRock and make a list of all the investment funds they run.

Part of that is that a webcrawler is a weapon of mass destruction,  a good
client can pitch more than an many servers can catch,  so the easier
products like this are to use the more complaints you get.

Web browsers have come a long way in a lot of ways but the mechanisms for
(1) bookmarks and (2) history are not good enough,  see:  so even if you
look at that case there is plenty of data on the client,  you are not GET
spamming people,  etc.

Also:  really what is the distinction of "client" and "server?"  It is
totally practical to run (say) a Windows application on a Windows tablet or
run the same application on a $10 an hour server at AWS over Remote Desktop
protocol and then the "client" could be Linux or an Amiga or something so
long as the network is fast.




On Thu, Jul 7, 2016 at 11:27 PM, Krzysztof Janowicz <janow...@ucsb.edu>
wrote:

> As such, it is hard to publish a paper on this
>> at any of the main venues (ISWC / ESWC / …).
>> This discourages working on such themes.
>>
>> Hence, I see much talent and time going to
>> incremental research, which is easy to evaluate well,
>> but not necessarily as ground-breaking.
>>
>
> Yes! I could not agree more. On the other hand, this is all about finding
> the right balance as we also do not want to have tons of 'ideas' papers
> without any substantial content or proof of concept. I remember that there
> was an ISWC session some years ago that tried to introduce such a 'bold
> ideas' track.
>
> Krzysztof
>
> On 07/06/2016 09:38 AM, Ruben Verborgh wrote:
>
>> Hi,
>>
>> This is a very important question for our community,
>> given that smart agents once were an important theme.
>> Actually, the main difference we could bring with the SemWeb
>> is that our clients could be decentralized
>> and actually run on the client side, in contrast to others.
>>
>> One of the main problems I see is how our community
>> (now particularly thinking about the scientific subgroup)
>> receives submissions of novel work.
>> We have evolved into an extremely quantitative-oriented view,
>> where anything that can be measured with numbers
>> is largely favored over anything that cannot.
>>
>> Given that the smart agents / bots field is quite new,
>> we don't know the right evaluation metrics yet.
>> As such, it is hard to publish a paper on this
>> at any of the main venues (ISWC / ESWC / …).
>> This discourages working on such themes.
>>
>> Hence, I see much talent and time going to
>> incremental research, which is easy to evaluate well,
>> but not necessarily as ground-breaking.
>> More than a decade of SemWeb research
>> has mostly brought us intelligent servers,
>> but not yet the intelligent clients we wanted.
>>
>> So perhaps we should phrase the question more broadly:
>> how can we as a community be more open
>> to novel and disruptive technologies?
>>
>> Best,
>>
>> Ruben
>>
>
>
> --
> Krzysztof Janowicz
>
> Geography Department, University of California, Santa Barbara
> 4830 Ellison Hall, Santa Barbara, CA 93106-4060
>
> Email: j...@geog.ucsb.edu
> Webpage: http://geog.ucsb.edu/~jano/
> Semantic Web Journal: http://www.semantic-web-journal.net
>
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: Dealing with distributed nature of Linked Data and SPARQL

2016-06-08 Thread Paul Houle
You've got it!

What matters is what your system believes is owl:sameAs based on its
viewpoint,  which could be based on who you trust to say owl:sameAs.  If
you are worried about "inference crashes" pruning this data is the place to
start.

You might want to apply algorithm X to a graph,  but data Y fails to have
property Z necessary for X to succeed.  It is a general problem if you are
sending a product downstream.

A processing module can massage a dataset so that the output graph Y always
has property Z or it fails and calls bloody murder if Z is not set,  etc.
It can emit warning messages that you could use to sweep for bad spots,
 etc.


On Wed, Jun 8, 2016 at 1:50 PM, Rob Davidson <rob.les.david...@gmail.com>
wrote:

> I'm not sure if I'm following exactly, so bear with me...
>
> If we have the same entity served up by two different sources then we
> might expect in an ideal world that there would be an OWL:sameAs or
> SKOS:exactMatch linking the two.
>
> If we have the same entity served by the same provider but via two
> different endpoints then we might expect something a bit like a
> DCAT:distribution link relating the two.
>
> Of course we might not have these specific links but I'm just trying to
> define the likely scenarios/use-cases.
>
> In either case, it's possible that the descriptions would be out of date
> and/or contradictory - this might cause inference crashes or simply be
> confusing if we tried to merge them too closely.
>
> Prioritising description fields based on the distribution method seems a
> little naive in that I might run either endpoint for a while, realise my
> users prefer the alternative and thus change technology in a direction
> unique to my users - not in a predictable fashion.
>
> So the only way I can see around this is to pool the descriptions but have
> them distinguished using the other metadata that indicates they come from
> different endpoints/sources/authors - keeping the descriptions on different
> graphs I suppose.
>
>
>
>
> On 8 June 2016 at 14:52, Paul Houle <ontolo...@gmail.com> wrote:
>
>> The vanilla RDF answer is that the data gathering module ought to pack
>> all of the graphs it got into named graphs that are part of a data set and
>> then pass that towards the consumer.
>>
>> You can union the named graphs for a primitive but effective kind of
>> "merge" or put in some module downstream that composites the graphs in some
>> arbitrary manner,  such as something that converts statements about people
>> to foaf: vocabulary to produce enough graph that would be piped downstream
>> to a foaf: consumer for instance.
>>
>> The named graphs give you sufficient anchor points to fill up another
>> dataset with metadata about what happened in the processing process so you
>> can follow "who is responsible for fact X?" past the initial data
>> transformations.
>>
>> On Wed, Jun 8, 2016 at 8:29 AM, Gray, Alasdair J G <a.j.g.g...@hw.ac.uk>
>> wrote:
>>
>>> Hi
>>>
>>> Option 3 seems sensible, particularly if you keep them in separate
>>> graphs.
>>>
>>> However shouldn’t you consider the provenance of the sources and
>>> prioritise them on how recent they were updated?
>>>
>>> Alasdair
>>>
>>> On 8 Jun 2016, at 13:06, Martynas Jusevičius <marty...@graphity.org>
>>> wrote:
>>>
>>> Hey all,
>>>
>>> we are developing software that consumes data both from Linked Data
>>> and SPARQL endpoints.
>>>
>>> Most of the time, these technologies complement each other. We've come
>>> across an issue though, which occurs in situations where RDF
>>> description of the same resources is available using both of them.
>>>
>>> Lest take a resource http://data.semanticweb.org/person/andy-seaborne
>>> as an example. Its RDF description is available in at least 2
>>> locations:
>>> - on a SPARQL endpoint:
>>>
>>> http://xmllondon.com/sparql?query=DESCRIBE%20%3Chttp%3A%2F%2Fdata.semanticweb.org%2Fperson%2Fandy-seaborne%3E
>>> - as Linked Data: http://data.semanticweb.org/person/andy-seaborne/rdf
>>>
>>> These descriptions could be identical (I haven't checked), but it is
>>> more likely than not that they're out of sync, complementary, or
>>> possibly even contradicting each other, if reasoning is considered.
>>>
>>> If a software agent has access to both the SPARQL endpoint and Linked
>>> Data resource, what should it consider as the resource description?
>>> There are at least 3 options:
>>>

Re: Dealing with distributed nature of Linked Data and SPARQL

2016-06-08 Thread Paul Houle
The vanilla RDF answer is that the data gathering module ought to pack all
of the graphs it got into named graphs that are part of a data set and then
pass that towards the consumer.

You can union the named graphs for a primitive but effective kind of
"merge" or put in some module downstream that composites the graphs in some
arbitrary manner,  such as something that converts statements about people
to foaf: vocabulary to produce enough graph that would be piped downstream
to a foaf: consumer for instance.

The named graphs give you sufficient anchor points to fill up another
dataset with metadata about what happened in the processing process so you
can follow "who is responsible for fact X?" past the initial data
transformations.

On Wed, Jun 8, 2016 at 8:29 AM, Gray, Alasdair J G <a.j.g.g...@hw.ac.uk>
wrote:

> Hi
>
> Option 3 seems sensible, particularly if you keep them in separate graphs.
>
> However shouldn’t you consider the provenance of the sources and
> prioritise them on how recent they were updated?
>
> Alasdair
>
> On 8 Jun 2016, at 13:06, Martynas Jusevičius <marty...@graphity.org>
> wrote:
>
> Hey all,
>
> we are developing software that consumes data both from Linked Data
> and SPARQL endpoints.
>
> Most of the time, these technologies complement each other. We've come
> across an issue though, which occurs in situations where RDF
> description of the same resources is available using both of them.
>
> Lest take a resource http://data.semanticweb.org/person/andy-seaborne
> as an example. Its RDF description is available in at least 2
> locations:
> - on a SPARQL endpoint:
>
> http://xmllondon.com/sparql?query=DESCRIBE%20%3Chttp%3A%2F%2Fdata.semanticweb.org%2Fperson%2Fandy-seaborne%3E
> - as Linked Data: http://data.semanticweb.org/person/andy-seaborne/rdf
>
> These descriptions could be identical (I haven't checked), but it is
> more likely than not that they're out of sync, complementary, or
> possibly even contradicting each other, if reasoning is considered.
>
> If a software agent has access to both the SPARQL endpoint and Linked
> Data resource, what should it consider as the resource description?
> There are at least 3 options:
> 1. prioritize SPARQL description over Linked Data
> 2. prioritize Linked Data description over SPARQL
> 3. merge both descriptions
>
> I am leaning towards #3 as the sensible solution. But then I think the
> end-user should be informed which part of the description came from
> which source. This would be problematic if the descriptions are
> triples only, but should be doable with quads. That leads to another
> problem however, that both LD and SPARQL responses are under-specified
> in terms of quads.
>
> What do you think? Maybe this is a well-known issue, in which case
> please enlighten me with some articles :)
>
>
> Martynas
> atomgraph.com
> @atomgraphhq
>
>
> Alasdair J G Gray
> Fellow of the Higher Education Academy
> Assistant Professor in Computer Science,
> School of Mathematical and Computer Sciences
> (Athena SWAN Bronze Award)
> Heriot-Watt University, Edinburgh UK.
>
> Email: a.j.g.g...@hw.ac.uk
> Web: http://www.macs.hw.ac.uk/~ajg33
> ORCID: http://orcid.org/-0002-5711-4872
> Office: Earl Mountbatten Building 1.39
> Twitter: @gray_alasdair
>
>
>
>
>
>
>
>
>
>
> Founded in 1821, Heriot-Watt is a leader in ideas and solutions. With
> campuses and students across the entire globe we span the world, delivering
> innovation and educational excellence in business, engineering, design and
> science.
>
> The contents of this e-mail (including any attachments) are confidential.
> If you are not the intended recipient of this e-mail, any disclosure,
> copying, distribution or use of its contents is strictly prohibited, and
> you should please notify the sender immediately and then delete it
> (including any attachments) from your system.
>



-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: Deprecating owl:sameAs

2016-04-01 Thread Paul Houle
It is not about stopping building naive applications,  it is about starting
to build smart applications..

Trust and provenance will only get you so far.  It can easily be the royal
road to becoming very good at seeing the Emperor's clothes.  Even
authoritative sources often have singularities or mismatches that make
basic invariants you'd expect wrong.

Decades ago my friends and I were reading the CIA World Fact book on a
Friday night and thinking how profound it was that there was a $100 billion
excess of global "exports" over "imports" and perhaps we'd stumbled on
evidence of extraterrestrial life or perhaps a secret civilization hidden
underground.

Eventually we figured it was that some of the exports wind up on the ocean
floor,  washing up to shore,  or stuck for centuries in gyres.  Also there
is a ratchet effect that pirates, government officials and other thieves
are more likely to remove valuable exports from ships and warehouses than
deposit them and so forth.

Now the accountants have gone through 15 years of blood, sweat and tears to
get XBRL financial reports which are logically sound 99% of the time for
U.S. public companies.  It is a problem for financial reports,  if you are
preparing them for the state,  the bank,  investors,  etc. and these
invariants are not met.

Structurally all kinds of demographic and similar numbers can be hypercubed
like XBRL but for a whole bunch of reasons,  will defy reason and never
quite "add up" when you compare multiple sources.  (I can point to a census
block where 200 people did not get counted because I didn't count them;
 the World Bank numbers for Nigeria are implausible for many reasons,  etc.)

As Reagan said it,  "Trust but verify" and that the essence of being a
reasonable animal.

Compare your input data with itself,  against its requirements,  against
the experience of the system and its users and you will find your
(system's) truth.





On Fri, Apr 1, 2016 at 9:16 AM, Barry Norton <barrynor...@gmail.com> wrote:

> Or we could stop building naive applications that treat assertion as fact,
> and instead only reason on statements we accept based on trust and
> provenance. Wasn't that the plan?
>
> Regards,
>
> Barry
>
> On Fri, Apr 1, 2016 at 2:01 PM, Sarven Capadisli <i...@csarven.ca> wrote:
>
>> There is overwhelming research [1, 2, 3] and I think it is evident at
>> this point that owl:sameAs is used inarticulately in the LOD cloud.
>>
>> The research that I've done makes me conclude that we need to do a
>> massive sweep of the LOD cloud and adopt owl:sameSameButDifferent.
>>
>> I think the terminology is human-friendly enough that there will be
>> minimal confusion down the line, but for the the pedants among us, we can
>> define it along the lines of:
>>
>>
>> The built-in OWL property owl:sameSameButDifferent links things to
>> things. Such an owl:sameSameButDifferent statement indicates that two URI
>> references actually refer to the same thing but may be different under some
>> circumstances.
>>
>>
>> Thoughts?
>>
>> [1] https://www.w3.org/2009/12/rdf-ws/papers/ws21
>> [2] http://www.bbc.co.uk/ontologies/coreconcepts#terms_sameAs
>> [3] http://schema.org/sameAs
>>
>> -Sarven
>> http://csarven.ca/#i
>>
>>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: Deprecating owl:sameAs

2016-04-01 Thread Paul Houle
I like the predicate :rewritesTo where

:x :rewritesTo :y .

entails that

:x ?a ?b => :y ?a ?b
?a :x ?b => ?a :y ?b
?a ?b :x => ?a ?b :y

And the "=>" means the second triple REPLACES the original triple.  For
instance this is often done when copying triples from one graph to
another.  This same operation can be done on triples that appear in a
SPARQL query.

The above accomplishes what people are trying to accomplish with owl:sameAs
with the difference that it 'works' in the sense that it doesn't cause the
number of entailments to explode.  If you are one of the few and the proud
who care if they get useful results to SPARQL queries it is a good way to
maintain a "UNA bubble" and seems to be a rather complete solution to the
problem of merging concepts.

now :sameButDifferent is a more complex one,  although the likes of count
Korzybski would maintain that

?a owl:sameAs ?a

is not necessarily true since ?a can be inevitably split into sub concepts.
 ("i.e. the number 2 that belongs to the emperor",  "the number 2 that,
 from a long ways away,  looks like a fly")

Concept splits are a different issue because you need a rule base or other
procedure to look a concept + surrounding relationships and decide how to
do the split.  It is very doable but it takes more than one triple to
express it.




On Fri, Apr 1, 2016 at 9:01 AM, Sarven Capadisli <i...@csarven.ca> wrote:

> There is overwhelming research [1, 2, 3] and I think it is evident at this
> point that owl:sameAs is used inarticulately in the LOD cloud.
>
> The research that I've done makes me conclude that we need to do a massive
> sweep of the LOD cloud and adopt owl:sameSameButDifferent.
>
> I think the terminology is human-friendly enough that there will be
> minimal confusion down the line, but for the the pedants among us, we can
> define it along the lines of:
>
>
> The built-in OWL property owl:sameSameButDifferent links things to things.
> Such an owl:sameSameButDifferent statement indicates that two URI
> references actually refer to the same thing but may be different under some
> circumstances.
>
>
> Thoughts?
>
> [1] https://www.w3.org/2009/12/rdf-ws/papers/ws21
> [2] http://www.bbc.co.uk/ontologies/coreconcepts#terms_sameAs
> [3] http://schema.org/sameAs
>
> -Sarven
> http://csarven.ca/#i
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: vocabularies for votes and suggestions

2016-01-24 Thread Paul Houle
Schema.org definitely covers this:

http://schema.org/VoteAction

it seems also you could add a little bit to SiOC to cover this too,  as
that SiOC knows a bit about Posts,  Comments and things like that.

What kind of database are you using to run the queries,  some kind of
relational or non-relational?

I would develop a very simple and stupid mapping from the database
structure to RDF,  for instance a relational row could be related as

[
   a :IceCreamFlavor;
   :firstColumnnName "Rocky Road" .
   :secondColumnName 277 .
]

and you can use "the good parts" (at least well understood parts) of
JSON-LD to do the same for all the popular JSON-LD databases that are out
there,  just missing the "-LD".

The key here is a direct physical mapping that preserves as much as
possible 100% all of the data.

At that point you have SPARQL and all that and it would be no problem
writing some SPARQL to convert that to some vocabulary you can be proud of.

The point is get it into the RDF universe right away and then you don't
have to waste a minute extra with all the non-RDF tools that suck!

On Sun, Jan 24, 2016 at 10:58 AM, Alexandre Rademaker <aradema...@gmail.com>
wrote:

> Hi all,
>
> Does anyone know vocabs for describe votes and suggestions? Our Portuguese
> Wordnet in  http://wnpt.brlcloud.com/wn/ is distributed from its
> beginning as RDF. We have recently develop this web interface for people
> make suggestions and vote in suggestions. Now we would like to have the
> suggestions and vote also in the triple store. Links and references are
> welcome.
>
> Best,
>
> 
> Alexandre Rademaker
> http://arademaker.gtihub.com
> http://researcher.ibm.com/person/br-alexrad
>
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: What Happened to the Semantic Web?

2015-11-12 Thread Paul Houle
Really I don't see how much better the search results on the right ("Google
CSE") are then the ones on the left.  It is a little like this:

http://www.audiocheck.net/blindtests_16vs8bit_NeilYoung.php

Google and Bing are stuck with P@1 at %70 or so because they don't always
know the intent of the question.

Various systems that put documents in a blender,  discarding the order of
the words,  perform astonishingly well at search,  classification and other
tasks -- anything "smarter" that this has to solve the "difficult" problems
that remain,  and little steps (like "not good" -> "bad" for sentiment
analysis) help only a few marginal cases.

The promise of semantics is to give people an experience they never had
before,  not move some score from .81 to .83.

On Thu, Nov 12, 2015 at 9:29 AM, David Wood <da...@3roundstones.com> wrote:

>
> On Nov 12, 2015, at 07:51, Kingsley Idehen <kide...@openlinksw.com> wrote:
>
> On 11/12/15 6:45 AM, Nicolas Chauvat wrote:
>
> Hi,
>
> On Wed, Nov 11, 2015 at 02:27:10PM -0500, Kingsley Idehen wrote:
>
> > > To me, The Semantic Web is like Google, but then run on my machine.
>
> > > To me its just a Web of Data [...]
>
> Ruben says "The Semantic Web" and Kingsley answers "just a Web of Data".
>
> In my tutorial "introduction to the semantic web" last week at
> SemWeb.Pro, I presented the Semantic Web and the Web of Data as the
> same thing.
>
> Then Fabien Gandon from Inria summarized the first session of the MOOC
> "Le Web Sémantique" and distinguished two items in a couple
> (web of data ; semantic web).
>
> It made me think that splitting the thing in two after the fact might
> have benefits:
>
> - Web of Data = what works today = 1st deliverable of the SemWeb Project
>
> - Semantic Web = what will work = prov, trust, inference, smartclient, etc.
>
> It allows us to say that The Semantic Web Project **has*delivered** its
> version 1, nicknamed "Web of Data", and that more versions will follow.
>
> [Hopefully in a couple years the "Web of Data" will have completely
> merged with the One True Web and nobody will care about making a
> distinction any more]
>
> That way of putting things fits well with the iterative/agile/lean
> culture of project management that is now spreading all over.
>
> Do you know of people that have been trying to sell things this way?
>
>
> Hopefully everyone :)
>
>
>
> +1  :)
>
> Regards,
> Dave
> --
> http://about.me/david_wood
>
>
>
>
> --
> Regards,
>
> Kingsley Idehen   
> Founder & CEO
> OpenLink Software
> Company Web: http://www.openlinksw.com
> Personal Weblog 1: http://kidehen.blogspot.com
> Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
> Twitter Profile: https://twitter.com/kidehen
> Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
> LinkedIn Profile: http://www.linkedin.com/in/kidehen
> Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: [Dbpedia-discussion] DBtax questions

2015-09-18 Thread Paul Houle
gn as many types as possible,
> provided that they are different from owl#Thing.
> In this way, we can cluster entities with more meaningful types and query
> the knowledge base accordingly.
>
> Of course, you can say that owl#Thing has 100% coverage, but does it make
> sense?
> The claimed 99% stems instead from a *set* of more specific types.
> Then high recall comes with a precision cost.
>
> On 9/17/15 4:04 PM, Magnus Knuth wrote:
> > One structural problem I recognized when seeing the approach [
> http://jens-lehmann.org/files/2015/semantics_dbtax.pdf], is that there is
> in most (non-complex) categories an article having exactly the same name,
> e.g. dbr:President dc:subject dbc:President. And indeed these resources are
> typed accordingly, e.g. http://it.dbpedia.org/resource/Presidente is a
> dbtax:President and http://it.dbpedia.org/resource/Pagoda is dbtax:Pagoda.
> That is obvious for a human, but is it the same for an algorithm? :-)
> >
> > A type coverage of more than 99 percent is very suspicious, because I’d
> expect much more resources in DBpedia not type-able. Why? A lot of articles
> in DBpedia describe very abstract concepts, e.g. Liberty, Nationality,
> Social_inequality (well, you have the class dbtax:Concept, but what is on
> the other hand not a concept?), or they describe classes by their selves,
> e.g. President, Country, Person, Plane (well, you have the class
> dbtax:Classification, but it is not used as such [
> http://it.dbpedia.org/sparql?default-graph-uri==SELECT+*+%7B%3Fres+a+%3Chttp%3A%2F%2Fdbpedia.org%2Fdbtax%2FClassification%3E%7D=text%2Fhtml=on]).
> For some articles it is arguable whether they are instance or class, e.g.
> Volkswagen_Polo, Horse.
> >
> > I see that the classes you extracted are truly valuable for enriching
> the DBpedia ontology, but it obviously needs some tidy up and disambiguate
> efforts.
> I completely agree: I think we should merge DBTax into the DBpedia
> ontology mappings wiki to do so.
> BTW, DBTax overlaps with the DBpedia ontology by more than 20%.
>
> Cheers!
>
>
>
>
> --
> Monitor Your Dynamic Infrastructure at Any Scale With Datadog!
> Get real-time metrics from all of your servers, apps and tools
> in one place.
> SourceForge users - Click here to start your Free Trial of Datadog now!
> http://pubads.g.doubleclick.net/gampad/clk?id=241902991=/4140
> ___
> Dbpedia-discussion mailing list
> dbpedia-discuss...@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/dbpedia-discussion
>



-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: Recommendation for transformation of RDF/XML to JSON-LD in a web browser?

2015-09-03 Thread Paul Houle
I think you have the best choice of RDF tooling in the Java world.  The
transformation you want to do is dead easy to do with Jena,  which supports
RDF/XML and many other formats

https://jena.apache.org/documentation/io/

there all kinds of ways to write web services in Java so this is a simple
task.

Javascript in the browser has access to the XML parser in the web browser
(written in C) but this is not in Node.js,  so you need to use a library
for that.  This tool claims to read and write both formats you are
interested in:

https://github.com/rdf-ext/rdf-ext






On Thu, Sep 3, 2015 at 10:19 AM, Frans Knibbe <frans.kni...@geodan.nl>
wrote:

> Hello,
>
> In a web application that is working with RDF data I would like to have
> all data available as JSON-LD, because I believe it is the easiest RDF
> format to process in a web application. At the moment I am particularly
> looking at processing vocabulary data. I think I can assume that such data
> will at least be available as RDF/XML. So I am looking for a way to
> transform RDF/XML to JSON-LD in a web browser.
>
> What would be the best or easiest way to do this? Attempt the
> transformation in the browser, using jsonld.js
> <https://github.com/digitalbazaar/jsonld.js> plus something else? Or use
> a server side component? And in the case of a server side component, which
> programming environment could be recommended? Python? Node.js? Any general
> or specific advice would be welcome.
>
> Greetings,
> Frans
>
> --
> Frans Knibbe
> Geodan
> President Kennedylaan 1
> 1079 MB Amsterdam (NL)
>
> T +31 (0)20 - 5711 347
> E frans.kni...@geodan.nl
> www.geodan.nl
> disclaimer <http://www.geodan.nl/disclaimer>
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: Please publish Turtle or JSON-LD instead of RDF/XML [was Re: Recommendation for transformation of RDF/XML to JSON-LD in a web browser?]

2015-09-03 Thread Paul Houle
Bernadette,

 it is not just perception,  it is reality.

 People find JSON-LD easy to work with,  and often it is a simple
lossless model-driven transformation from an RDF graph to a JSON graph that
people can do what they want with.

 Ultimately RDF is a universal data model and it is the data model that
is important,  NOT the specific implementations.  For instance you can do a
model-driven transformation of data from RDF to JSON-LD and then any JSON
user can access it with few hangups even if they are unaware of JSON-LD.
Add some JSON-LD tooling and you've got JSON++.

  We can use a use relational-logical-graphical methods to process
handle data and we can accept and publish JSON with the greatest of ease.

On Thu, Sep 3, 2015 at 5:18 PM, Bernadette Hyland <bhyl...@3roundstones.com>
wrote:

> +1 David, well said.
>
> Amazing how much the mention of JSON (in the phase JSON-LD) puts people at
> ease vs. RDF .  JSON-LD as a Recommendation has helped lower the
> defenses of many who used to get their hackles up and say ‘RDF is too hard'.
>
> Perception counts for a lot, even for highly technical people including
> Web developers.
>
> Cheers,
>
> Bernadette Hyland
> CEO, 3 Round Stones, Inc.
>
> http://3roundstones.com  || http://about.me/bernadettehyland
>
>
> On Sep 3, 2015, at 1:03 PM, David Booth <da...@dbooth.org> wrote:
>
> Side note: RDF/XML was the first RDF serialization standardized, over 15
> years ago, at a time when XML was all the buzz. Since then other
> serializations have been standardized that are far more human friendly to
> read and write, and easier for programmers to use, such as Turtle and
> JSON-LD.
>
> However, even beyond ease of use, one of the biggest problems with RDF/XML
> that I and others have seen over the years is that it misleads people into
> thinking that RDF is a dialect of XML, and it is not.  I'm sure this
> misconception was reinforced by the unfortunate depiction of XML in the
> foundation of the (now infamous) semantic web layer cake of 2001, which in
> hindsight is just plain wrong:
> http://www.w3.org/2001/09/06-ecdl/slide17-0.html
> (Admittedly JSON-LD may run a similar risk, but I think that risk is
> mitigated now by the fact that RDF is already more established in its own
> right.)
>
> I encourage all RDF publishers to use one of the other standard RDF
> formats such as Turtle or JSON-LD.  All commonly used RDF tools now support
> Turtle, and many or most already support JSON-LD.
>
> RDF/XML is not officially deprecated, but I personally hope that in the
> next round of RDF updates, we will quietly thank RDF/XML for its faithful
> service and mark it as deprecated.
>
> David Booth
>
>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Re: Please publish Turtle or JSON-LD instead of RDF/XML [was Re: Recommendation for transformation of RDF/XML to JSON-LD in a web browser?]

2015-09-03 Thread Paul Houle
The discussion in the other thread has shown that RDF/XML conversion to
JSON-LD or Turtle is easy in many popular languages.  Often I have to
export stuff as RDF/XML for use by older tools and I have not hit the
corner cases in the export,  although I sure have hit corner cases inside
the tools and with good 'old N-Triples.

I agree with the theme that pedagogy is important,  RDF beginners or
part-timers need to see how simple it really is and the newer formats help
with that.

There is though a change in psychology that comes with JSON-LD which boils
down to the role of ordered collections in RDF.

Despite all it lacks,  JSON is broadly popular and I think that is because
it is based on the $-scalar, @-list and %-hash model that was in Perl,
 that most dynamic languages since then put those on your fingertips and
they are in the standard library even in those awkward languages.

There is a fear of ordered collection in RDF because it brings up a number
of issues with blank nodes,  but when it comes down to it you can express
what people are trying to express in JSON in Turtle and it doesn't look all
that different.

JSON-LD transparently adds what is missing in JSON,  such as decimal types
which are correct for handling money as well as the ISO 8401 'standard' for
dates and times which is better than no standard at all.

It is so much fun to work with a RDF graph the same way you would with a
JSON document,  then do some rules-based inference,  do a SPARQL query or
add the graph to a big graph with millions of other documents and query
*that* with SPARQL.

The connection with XML is something to be revisited too.  Kurt Cagle told
me that he got into RDF because he saw it is as the logical continuation of
XML.  That's really right,  since RDF incorporates the types from
XML-schema.

What we need is a way to throw XML into a meat grinder,  decompose it into
triples without any configuration at all.  I will say I had GRDDL,  XSLT
and all they stand for and much prefer to just build out an XML DOM graph,
 which won't be quite the graph you want but with SPARQL CONSTRUCT queries,
 production rules, and a few specialized operators it is easy and much more
structurally stable than XSLT.



On Thu, Sep 3, 2015 at 3:04 PM, Sarven Capadisli <i...@csarven.ca> wrote:

> On 2015-09-03 19:03, David Booth wrote:
>
>> I encourage all RDF publishers to use one of the other standard RDF
>> formats such as Turtle or JSON-LD.  All commonly used RDF tools now
>> support Turtle, and many or most already support JSON-LD.
>>
>
> I have grown to (or to be brutally honest; tried very hard to) remain
> agnostic about the RDF formats. This is simply because given sufficient
> context, it is trivial to point out which format is preferable for both
> publication and expected consumption.
>
> The decision to pick one or more formats over the other can easily boil
> down to understanding how and what will be handling the formats in the
> whole data pipeline.
>
> It is great to see newcomers learn N-Triples/Turtle, because it is as
> human-friendly as it gets (at this time) to read and write statements. That
> experience is also an excellent investment towards SPARQL. Having said
> that, we are not yet at a state to publish semantically meaningful HTML
> documents by authoring Turtle. There is the expectation that some other out
> of band code or application needs to wrap it all up.
>
> By the same token, JSON-LD is excellent for building applications by
> imperative means, however, it is stuck in a world where it is dependent on
> languages to manipulate and make use of the data. To generate it, it
> depends on something else as well.  structure just like RDF/XML here. GOTO 10>.
>
> At the end of the day, however the data is pulled or pushed, it needs to
> end up on some user-interface. That UI is arguably and predominantly an
> HTML document out there. Hence my argument is that, all roads lead to HTML.
>
> As I see it, RDFa gets the most mileage above all other formats for prose
> content, and a fair amount of re-use. It ends up on a webpage that is
> intended for humans, meanwhile remaining machine-friendly. A single code
> base (which is mostly declarative), a single GET, a single URL
> representation to achieve all of that.
>
> I still remain agnostic on this matter, because there is no one size fits
> all. After all, in the command-line, N-Triples still has the last word.
>
> So, as long as one speaks the RDF language, the rest tends to be something
> that the machines should be doing on behalf of humans any way, and that
> ought to remain as the primary focus. That is, speak RDF, keep improving
> the UI for it.
>
> All formats bound to age - with the exception of HTML of course, because
> it still rocks and has yet to fail! ;)
>
> -Sarven
> http://csarven.ca

Re: Please publish Turtle or JSON-LD instead of RDF/XML [was Re: Recommendation for transformation of RDF/XML to JSON-LD in a web browser?]

2015-09-03 Thread Paul Houle
Maybe saying it is "pedagogical" is being charitable.

Working programmers learn by book much less than they should;  the strategy
of "search Google,  copy some code from stackoverflow,  and mess around
with it until it seems to work" is common and what we can do about it is to
push the new syntax so hard in every way so that we can drown out the old
stuff.

On Thu, Sep 3, 2015 at 5:40 PM, Kevin Ford <k...@3windmills.com> wrote:

> Hi David,
>
> If by 'publishing' you mean 'from a web service for consumption' then I
> feel the suggestion to deprecate RDF/XML is an over correction.  Of course,
> it is not too difficult to move between RDF serializations, but if the
> publishing service provides a variety of serializations, it is likely to
> increase the usefulness of that service to a consumer.
>
> Diminish its role as a pedagogical tool.  That's the issue, no?
>
> Best,
> Kevin
>
>
>
> On 9/3/15 4:11 PM, David Booth wrote:
>
>> Hi John,
>>
>> I can appreciate the value of RDF/XML for certain processing tasks, and
>> I'm okay with keeping RDF/XML alive as a *processing* format.  My
>> suggestion to deprecate RDF/XML was intended to apply to its use as a
>> *publishing* format.
>>
>> Thanks,
>> David Booth
>>
>> On 09/03/2015 03:52 PM, John Walker wrote:
>>
>>> Hi Martynas,
>>>
>>> Indeed abandoning XML based serialisations would be foolish IMHO.
>>>
>>> Both RDF/XML and TriX can be extremely useful in certain circumstances.
>>>
>>> John
>>>
>>> On 3 Sep 2015, at 19:53, Martynas Jusevičius <marty...@graphity.org>
>>> wrote:
>>>
>>> With due respect, I think it would be foolish to burn the bridges to
>>>> XML. The XML standards and infrastructure are very well developed,
>>>> much more so than JSON-LD's. We use XSLT extensively on RDF/XML.
>>>>
>>>> Martynas
>>>> graphityhq.com
>>>>
>>>> On Thu, Sep 3, 2015 at 8:03 PM, David Booth <da...@dbooth.org> wrote:
>>>>
>>>>> Side note: RDF/XML was the first RDF serialization standardized,
>>>>> over 15
>>>>> years ago, at a time when XML was all the buzz. Since then other
>>>>> serializations have been standardized that are far more human
>>>>> friendly to
>>>>> read and write, and easier for programmers to use, such as Turtle and
>>>>> JSON-LD.
>>>>>
>>>>> However, even beyond ease of use, one of the biggest problems with
>>>>> RDF/XML
>>>>> that I and others have seen over the years is that it misleads
>>>>> people into
>>>>> thinking that RDF is a dialect of XML, and it is not.  I'm sure this
>>>>> misconception was reinforced by the unfortunate depiction of XML in the
>>>>> foundation of the (now infamous) semantic web layer cake of 2001,
>>>>> which in
>>>>> hindsight is just plain wrong:
>>>>> http://www.w3.org/2001/09/06-ecdl/slide17-0.html
>>>>> (Admittedly JSON-LD may run a similar risk, but I think that risk is
>>>>> mitigated now by the fact that RDF is already more established in
>>>>> its own
>>>>> right.)
>>>>>
>>>>> I encourage all RDF publishers to use one of the other standard RDF
>>>>> formats
>>>>> such as Turtle or JSON-LD.  All commonly used RDF tools now support
>>>>> Turtle,
>>>>> and many or most already support JSON-LD.
>>>>>
>>>>> RDF/XML is not officially deprecated, but I personally hope that in
>>>>> the next
>>>>> round of RDF updates, we will quietly thank RDF/XML for its faithful
>>>>> service
>>>>> and mark it as deprecated.
>>>>>
>>>>> David Booth
>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>>
>>
>


-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
<http://legalentityidentifier.info/lei/lookup/>

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275


Should openadresses published Linked Data?

2015-07-27 Thread Paul Houle
see discussion here:

https://github.com/openaddresses/openaddresses/issues/1098

-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup/
http://legalentityidentifier.info/lei/lookup/


Freebase is dead, long live :BaseKB

2015-07-15 Thread Paul Houle
We've just released the ultimate (last) release of :BaseKB,  based on the
last data dump.

Freebase will soon be discontinuing the API which allows queries by the
proprietary MQL interface,  however,  we have a conversion of their data
dump to a 1.2 billion triple product which is standards compliant.

http://basekb.com/gold/

Those interested in experimenting or evaluating this product should try the
Amazon Marketplace AMI that will give you the data,  together with a
powerful triple store,  stacked up on the right hardware to give great
performance -- in just ten minutes.

-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup/
http://legalentityidentifier.info/lei/lookup/


Re: [ANN] beta release of 'Linked Data Reactor' for component-based LD application development

2015-06-23 Thread Paul Houle
I like the speed.

There is the still the same problem most of these things have that the
ratio of chrome to facts is not so good.  The U.I. as it is would be a good
foundation for a full editing interface but I think demos need to have
denser looking landing pages.

On Tue, Jun 23, 2015 at 11:44 AM, Melvin Carvalho melvincarva...@gmail.com
wrote:



 On 23 June 2015 at 16:55, Kingsley Idehen kide...@openlinksw.com wrote:

 On 6/23/15 9:07 AM, Ali Khalili wrote:

 Dear all,
 we are happy to announce the beta release of our LD-R (Linked Data
 Reactor) framework for developing component-based Linked Data applications:

 http://ld-r.org

 The LD-R framework combines several state-of-the-art Web technologies to
 realize the vision of Linked Data components.
 LD-R is centered around Facebook's ReactJS and Flux architecture for
 developing Web components with single directional data flow.
 LD-R offers the first Isomorphic Semantic Web application (i.e. using
 the same code for both client and server side) by dehydrating/rehydrating
 states between the client and server.

 The main features of LD-R are:
 - User Interface as first class citizen.
 - Isomorphic SW application development
 - Reuse of current Web components within SW apps.
 - Sharing components and configs rather than application code.
 - Separation of concerns.
 - Flexible theming for SW apps.

 This is the beta release of LD-R and we are working on enhancing the
 framework. Therefore, your feedback is more than welcome.

 For more information, please check the documentation on http://ld-r.org
 or refer to the Github repository at http://github.com/ali1k/ld-r

 Ali Khalili
 Knowledge Representation and Reasoning (KRR) research group
 The Network Institute
 Computer Science Department
 VU University Amsterdam
 http://krr.cs.vu.nl/


 Ali,

 Great stuff !

 BTW -- What credentials should be used with the demo? If signup is
 required, what's the signup verification turnaround time?


 My user account was activated in about 10 minutes (may have been quicker)

 However the profile document gives a 404

 http://ld-r.org/user/1435073636



 --
 Regards,

 Kingsley Idehen
 Founder  CEO
 OpenLink Software
 Company Web: http://www.openlinksw.com
 Personal Weblog 1: http://kidehen.blogspot.com
 Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
 Twitter Profile: https://twitter.com/kidehen
 Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
 LinkedIn Profile: http://www.linkedin.com/in/kidehen
 Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this






-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup
http://legalentityidentifier.info/lei/lookup


Re: Profiles in Linked Data

2015-05-11 Thread Paul Houle
Why not just POST some kind of RDF document (or JSON-LD) that describes
what is you want and in what format,  or if you really have to use GET,
 stuff it in a GET field and hope it's not too big?

On Mon, May 11, 2015 at 12:07 PM, John Walker john.wal...@semaku.com
wrote:

  Hi Lars,

  On May 11, 2015 at 5:39 PM Svensson, Lars l.svens...@dnb.de wrote:
  
   I note in the JSON-LD spec it is stated A profile does not change the
 semantics
   of the resource representation when processed without profile
 knowledge, so
   that clients both with and without knowledge of a profiled resource
 can safely
   use the same representation, which would no longer hold true if the
 profile
   parameter were used to negotiate which vocabulary/shape is used.
 
  Yes, I noted that text in RFC 6906, too, but assumed that unchanged
 semantics of the resource meant that both representations still describe
 the same thing (which they do in my case). Would a change in description
 vocabulary really mean that I change the semantics of the description?

 If it is exactly the same information in both representations (but using a
 different vocabulary), then you could argue the semantics are not changed.
 However I would expect that one representation would contain more/less
 information that another and that each vocabulary might have different
 inference rules, so indeed then semantics would differ.

 
  If so, I'd be happy to call it not a profile, but a shape instead
 (thus adopting the vocabulary of RDF data shapes).

 I don't mind what term we use, so long as it is clear to all concerned
 what is meant by that term :)

 
  Best,
 
  Lars

 John




-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup
http://legalentityidentifier.info/lei/lookup


Re: Profiles in Linked Data

2015-05-11 Thread Paul Houle
 rules to other content types
 (e.g., those related to images, sound, and video etc..) .



  Link:
 
 http://mementoarchive.lanl.gov/dbpedia/timegate/http://dbpedia.org/resour
 ce/Analytics; rel=timegate
 Location: http://dbpedia.org/data/Analytics.ttl
 Expires: Tue, 12 May 2015 16:01:06 GMT
 Cache-Control: max-age=604800

 Best,

 Lars



 --
 Regards,

 Kingsley Idehen
 Founder  CEO
 OpenLink Software
 Company Web: http://www.openlinksw.com
 Personal Weblog 1: http://kidehen.blogspot.com
 Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
 Twitter Profile: https://twitter.com/kidehen
 Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
 LinkedIn Profile: http://www.linkedin.com/in/kidehen
 Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this





-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup
http://legalentityidentifier.info/lei/lookup


Re: Profiles in Linked Data

2015-05-11 Thread Paul Houle
To contextualize this,  I have a data transformation system that converts
common CSV,  XML and JSON data to RDF that needs just a tiny amount of
configuration data to work in most cases.  This has a configuration file
that looks like

@prefix : http://rdf.ontology2.com/csvConf/

 :sourceFile file:///data/fatca/FFIListFull.csv ;
   :destinationFile file:///data/fatca/FFI.ttl ;
   :predicateNamespace
http://rdf.legalentityidentifier.info/FFIListFull/predicate/ ;
   :rowNamespace http://rdf.legalentityidentifier.info/FFIListFull/row/ ;
   :keyField 1 .


Note the only things that need to be configured,  other than the input
and the output,  is the namespace that predicates go into and the
namespace used for row identifiers,  plus the location of the
keyfield.  If there is no primary key,  then the system uses blank
nodes as row keys.  It is obvious how to extend the vocabulary so you
can say something about the data type of the fields,  how to change
the default mapping from fields to prefixes,  etc.

This is probably not exactly what you want for a public endpoint,  but
the idea that you can configure a data transformation in RDF still
applies.  The main issue I see is that relatively small (or internal)
endpoints may want to allow people to configure arbitrary
transformations,  but a large or public endpoint may offer a limited
menu of transformations which could be precomputed or preapproved.



On Mon, May 11, 2015 at 1:05 PM, Svensson, Lars l.svens...@dnb.de wrote:

 Paul,

  Why not just POST some kind of RDF document (or JSON-LD) that describes
  what is you want and in what format,  or if you really have to use GET,

 Interesting thought. So what I would POST is essentially the profile/shape
 I want the data to conform to. Nonetheless my view of linked data is that
 it's using GET all the time, so I'd probably stick to that and instead
 serve my profile/shape under an http URI and only reference it from the GET
 request instead of POSTing it.

  stuff it
  in a GET field and hope it's not too big?

 What exactly do you mean with a GET field; is that the same thing as an
 http header?

 Best,

 Lars




-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup
http://legalentityidentifier.info/lei/lookup


Re: Vocabulary to describe software installation

2015-05-02 Thread Paul Houle
Could you RDFize the model used by packaging systems such as rpm?

My latest ideological kick has been to realize that converting data from
non-RDF sources to RDF is often trivial or close to trivial.  This fact has
been obscured by things like the TWC data.gov thing that make it look a lot
harder than it really is.

On Fri, May 1, 2015 at 10:22 AM, Jürgen Jakobitsch 
j.jakobit...@semantic-web.at wrote:

 hi,

 i'm investigating possibilities to describe an arbitrary software
 installation process
 in rdf. currently i've found two candidates [1][2], but examples are
 practically non-existent.
 has anyone done this before, are there somewhere real examples?

 any pointer greatly appreciated.

 wkr j

 [1] http://www.w3.org/2005/Incubator/ssn/ssnx/ssn#Module_Deployment
 [2]
 http://wiki.dublincore.org/index.php/User_Guide/Publishing_Metadata#dcterms:instructionalMethod

 | Jürgen Jakobitsch,
 | Software Developer
 | Semantic Web Company GmbH
 | Mariahilfer Straße 70 / Neubaugasse 1, Top 8
 | A - 1070 Wien, Austria
 | Mob +43 676 62 12 710 | Fax +43.1.402 12 35 - 22

 COMPANY INFORMATION
 | web   : http://www.semantic-web.at/
 | foaf  : http://company.semantic-web.at/person/juergen_jakobitsch
 PERSONAL INFORMATION
 | web   : http://www.turnguard.com
 | foaf  : http://www.turnguard.com/turnguard
 | g+: https://plus.google.com/111233759991616358206/posts
 | skype : jakobitsch-punkt
 | xmlns:tg  = http://www.turnguard.com/turnguard#;




-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup
http://legalentityidentifier.info/lei/lookup


Re: Best practices on how to publish SPARQL queries?

2015-04-27 Thread Paul Houle
One thing I have done is make a set of integration tests with JUnit,  so
the queries are embedded and you check that you get the right answers.

On Sun, Apr 26, 2015 at 4:14 PM, Neubert, Joachim j.neub...@zbw.eu wrote:

 Hi Niklas,

 Github (and similar services) offer a great platform to publish
 vocabularies and queries, particularly if they are evolving in sync and are
 backed with a corresponding endpoint. How we managed to complement this
 with a SPARQL-IDE, which allows people to experiment with queries and
 immediately see the results, is described here:

 http://zbw.eu/labs/en/blog/publishing-sparql-queries-live

 The approach is used extensively in the skos-history project (
 https://github.com/jneubert/skos-history).

 Cheers, Joachim

  -Original Message-
  From: Niklas Petersen [mailto:peter...@cs.uni-bonn.de]
  Sent: Sunday, April 26, 2015 1:01 PM
  To: semantic-...@w3.org; public-lod@w3.org
  Subject: Best practices on how to publish SPARQL queries?
 
  Hi all,
 
  I am currently developing a vocabulary which has typical queries
 related
  to it. I am wondering if there exist any best practices to publish them
  together with the vocabulary?
 
  The best practices on publishing Linked Data [1] only focuses on the
  endpoints, but not on the queries.
 
  Has anyone else been in that situation?
 
 
  Best regards,
  Niklas Petersen
 
  [1] http://www.w3.org/TR/ld-bp/#MACHINE
 
  --
  Niklas Petersen,
  Organized Knowledge Group @Fraunhofer IAIS, Enterprise Information
  Systems Group @University of Bonn.




-- 
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
https://legalentityidentifier.info/lei/lookup
http://legalentityidentifier.info/lei/lookup


Re: Looking for pedagogically useful data sets

2015-03-12 Thread Paul Houle
One of my favorite parts of this classic book

http://www.amazon.com/gp/product/0201517523/ref=as_li_tl?ie=UTF8camp=1789creative=390957creativeASIN=0201517523linkCode=as2tag=honeymediasys-20linkId=DA6EAFVQC6QUHS5F

is the explanation of why you will hit the wall with human-readable terms
in a large-scale knowledge base plus the way human readable terms confuse
people as to what computers can understand about them.

That said,  I know I got a lot of resistance from people when the first
version of :BaseKB used mid identifiers for everything,  even though I know
I got 100% relational integrity that way and also it meant I could use 32
bit ints for identifiers in my code.

For large scale systems,  thus,  you are going to have to deal with
predicates such as V96 and P1158 not to mind human readable terms
which are sometimes as bad or worse,  and ultimately you need tooling to
help with that.

In terms of the problem in front of me,  I have 10 minutes to say something
meaningful to people who (i) don't know anything about RDF and SPARQL and
(ii) aren't necessarily convinced about it's value.  This is not so bad as
it seems because this is going to be pre-recorded so it can be scripted,
 edited and so forth.

Definitely there is not going to be any special tooling,  visualization,
 inference,  or so forth.  The goal is to show that you can do the same
things you do with a relational database,  and maybe *just* a little bit
more.  To pull this off I need to have as many thing experience near as
possible (see http://tbips.blogspot.com/2011/01/experience-near.html)

Certainly FOAF data would be good for this,  it really is a matter of
finding some specific file that is easy to work with.

On Thu, Mar 12, 2015 at 3:54 AM, Sarven Capadisli i...@csarven.ca wrote:

 On 2015-03-12 00:13, Paul Houle wrote:

 Hello all,

I am looking for some RDF data sets to use in a short presentation
 on
 RDF and SPARQL.  I want to do a short demo,  and since RDF and SPARQL will
 be new to this audience,  I was hoping for something where the predicates
 would be easy to understand.

   I was hoping that the LOGD data from RPI/TWC would be suitable,  but
 once I found the old web site (the new one is down) and manually fixed the
 broken download link I found the predicates were like

 http://data-gov.tw.rpi.edu/vocab/p/1525/v96

 and the only documentation I could find for them (maybe I wasn't looking
 in
 the right place) was that this predicate has an rdf:label of V96.)

 Note that an alpha+numeric code is good enough for Wikidata and it is
 certainly concise,  but I don't want :v96 to be the first things that
 these
 people see.

 Something I like about this particular data set is that it is about 1
 million triples which is big enough to be interesting but also small
 enough
 that I can load it in a few seconds,  so that performance issues are not a
 distraction.

 The vocabulary in DBpedia is closer to what I want (and if I write the
 queries most of the distracting things about vocab are a non-issue) but
 then data quality issues are the distraction.

 So what I am looking for is something around 1 m triples in size (in terms
 of order-of-magnitude) and where there are no distractions due to obtuse
 vocabulary or data quality issues.  It would be exceptionally cool if
 there
 were two data sets that fit the bill and I could load them into the triple
 store together to demonstrate mashability

 Any suggestions?



 re: predicates would be easy to understand, whether the label is V96 or
 some molecule, needless to say, it takes some level of familiarity with the
 data.

 Perhaps something that's familiar to most people is Social Web data. I
 suggest looking at whatever is around VCard, FOAF, SIOC for instance. The
 giant portion in the LOD Cloud with the StatusNet nodes (in cyan) use FOAF
 and SIOC. (IIRC, unless GnuSocial is up to something else these days.)


 If statistical LD is of interest, check out whatever is under
 http://270a.info/ (follow the VoIDs to respective dataspaces). You can
 reach close to 10k datasets there, with varying sizes. I think the best bet
 for something small enough is to pick one from the
 http://worldbank.270a.info/ dataspace e.g., GDP, mortality, education..

 Or take an observation from somewhere, e.g:

 http://ecb.270a.info/dataset/EXR/Q/ARS/EUR/SP00/A/2000-Q2

 and follow-your-nose.

 You can also approach from a graph exploration POV, e.g:

 http://en.lodlive.it/?http://worldbank.270a.info/classification/country/CA

 or a visualization, e.g., Sparkline (along the lines of how it was
 suggested by Edward Tufte):

 http://stats.270a.info/sparkline

 (JavaScript inside SVG building itself by poking at the SPARQL endpoint)

 If you want to demonstrate what other type of things you can do with this
 data, consider something like:

 http://stats.270a.info/analysis/worldbank:SP.DYN.
 IMRT.IN/transparency:CPI2011/year:2011

 See also Oh Yeah? and so on..


 Any way... as a starting point

Looking for pedagogically useful data sets

2015-03-11 Thread Paul Houle
Hello all,

  I am looking for some RDF data sets to use in a short presentation on
RDF and SPARQL.  I want to do a short demo,  and since RDF and SPARQL will
be new to this audience,  I was hoping for something where the predicates
would be easy to understand.

 I was hoping that the LOGD data from RPI/TWC would be suitable,  but
once I found the old web site (the new one is down) and manually fixed the
broken download link I found the predicates were like

http://data-gov.tw.rpi.edu/vocab/p/1525/v96

and the only documentation I could find for them (maybe I wasn't looking in
the right place) was that this predicate has an rdf:label of V96.)

Note that an alpha+numeric code is good enough for Wikidata and it is
certainly concise,  but I don't want :v96 to be the first things that these
people see.

Something I like about this particular data set is that it is about 1
million triples which is big enough to be interesting but also small enough
that I can load it in a few seconds,  so that performance issues are not a
distraction.

The vocabulary in DBpedia is closer to what I want (and if I write the
queries most of the distracting things about vocab are a non-issue) but
then data quality issues are the distraction.

So what I am looking for is something around 1 m triples in size (in terms
of order-of-magnitude) and where there are no distractions due to obtuse
vocabulary or data quality issues.  It would be exceptionally cool if there
were two data sets that fit the bill and I could load them into the triple
store together to demonstrate mashability

Any suggestions?

-- 
Paul Houle
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Enterprise information system

2015-02-26 Thread Paul Houle
 Portchester Rise
Eastleigh
SO50 4QS
 Mobile: +44 75 9533 4155, Home: +44 23 8061 5652







-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Microsoft Access for RDF?

2015-02-20 Thread Paul Houle
So some thoughts here.

OWL,  so far as inference is concerned,  is a failure and it is time to
move on.  It is like RDF/XML.

As a way of documenting types and properties it is tolerable.  If I write
down something in production rules I can generally explain to an average
joe what they mean.  If I try to use OWL it is easy for a few things,
 hard for a few things,  then there are a few things Kendall Clark can do,
 and then there is a lot you just can't do.

On paper OWL has good scaling properties but in practice production rules
win because you can infer the things you care about and not have to
generate the large number of trivial or otherwise uninteresting conclusions
you get from OWL.

As a data integration language OWL points in an interesting direction but
it is insufficient in a number of ways.  For instance,  it can't convert
data types (canonicalize mailto:j...@example.com and j...@example.com),
 deal with trash dates (have you ever seen an enterprise system that didn't
have trash dates?) or convert units.  It also can't reject facts that don't
matter and so far as both timespace and accuracy you do much easier if you
can cook things down to the smallest correct database.



The other one is that as Kingsley points out,  the ordered collections do
need some real work to square the circle between the abstract graph
representation and things that are actually practical.

I am building an app right now where I call an API and get back chunks of
JSON which I cache,  and the primary scenario is that I look them up by
primary key and get back something with a 1:1 correspondence to what I
got.  Being able to do other kind of queries and such is sugar on top,  but
being able to reconstruct an original record,  ordered collections and all,
 is an absolute requirement.

So far my infovore framework based on Hadoop has avoided collections,
 containers and all that because these are not used in DBpedia and
Freebase,  at least not in the A-Box.  The simple representation that each
triple is a record does not work so well in this case because if I just
turn blank nodes into UUIDs and spray them across the cluster,  the act of
reconstituting a container would require an unbounded number of passes,
 which is no fun at all with Hadoop.  (At first I though the # of passes
was the same as the length of the largest collection but now that I think
about it I think I can do better than that)  I don't feel so bad about most
recursive structures because I don't think they will get that deep but I
think LISP-Lists are evil at least when it comes to external memory and
modern memory hierarchies.


Re: Microsoft Access for RDF?

2015-02-20 Thread Paul Houle
Pat,

so far as corporation is a person that is what we have foaf:Agent
for.  A corporation can sign contracts and be an endpoint for communication
and payments the same as a person so to model the world of law,  business,
 finance and stuff that is a very real thing.

   If you take that idea too literally,  however,  it conflicts with a
person is an animal in terms of physiology,  but that too can be modelled.

Cristoph,

   the trouble with OWL is that things that almost work have a way of
displacing things that do work,  particularly in a community that has the
incentive structures that the semweb community has.  We have the problem of
a really bad rep in many circles,  I see people say stuff like this all the
time

http://lemire.me/blog/archives/2014/12/02/when-bad-ideas-will-not-die-from-classical-ai-to-linked-data/

and I have to admit that back in 2004 I was the guy who stood in the back
of the conference room and said isn't this like the stuff they tried in
the 80's that didn't work.  A lot of people believe that guff,  and
combine that with the road rage of people who look for US states in DBpedia
and find that 3 of them got dropped on the floor,  it can be very hard to
get taken seriously.

Lemire's unconstructive criticism displaces real criticism,  but that kind
of criticism could be displaced by constructive criticism about the
standards we have.

For instance,  I think RDF Data Shapes is a great idea but I needed it back
in 2007 and it just astonishing to me that it took so long for it to happen.

(Now I must admit I am most curious about why it is standards for rules
interchange,  i.e. the RuleML family,  KIF,  and a few others have had such
a hard time going,  whereas you find things like Drools,  Blaze Advisor,
 and iLog running many real world systems.)

On Fri, Feb 20, 2015 at 12:45 PM, Pat Hayes pha...@ihmc.us wrote:


 On Feb 20, 2015, at 2:42 AM, Michael Brunnbauer bru...@netestate.de
 wrote:

 
  Hello Paul,
 
  On Thu, Feb 19, 2015 at 09:19:06PM +0100, Michael Brunnbauer wrote:
  Another case is where there really is a total ordering.  For
 instance,  the
  authors of a scientific paper might get excited if you list them in the
  wrong order.  One weird old trick for this is RDF containers,  which
 are
  specified in the XMP dialect of Dublin Core
 
  How do you bring this in line with property rdfs:range datatype,
 especially
  property rdfs:range rdf:langString? I do not see a contradiction but
 this
  makes things quite ugly.
 
  How about all the SPARQL queries that assume a literal as object and
 not a RDF
  container?
 
  Another simpler example would be property rdfs:range foaf:Person.
  http://xmlns.com/foaf/spec/#term_Person says that Something is a
 Person if it
  is a person. How can an RDF container of several persons be a person?

 According the US Supreme Court a corporation is a person, so I would guess
 that a mere container would have no trouble geting past the censors.

 Pat

 
  If one can put a container where a container is not explicitly
 sanctioned by
  the semantics of the property, then I have missed something important.
 
  Regards,
 
  Michael Brunnbauer
 
  --
  ++  Michael Brunnbauer
  ++  netEstate GmbH
  ++  Geisenhausener Straße 11a
  ++  81379 München
  ++  Tel +49 89 32 19 77 80
  ++  Fax +49 89 32 19 77 89
  ++  E-Mail bru...@netestate.de
  ++  http://www.netestate.de/
  ++
  ++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
  ++  USt-IdNr. DE221033342
  ++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
  ++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel

 
 IHMC (850)434 8903 home
 40 South Alcaniz St.(850)202 4416   office
 Pensacola(850)202 4440   fax
 FL 32502  (850)291 0667   mobile (preferred)
 pha...@ihmc.us   http://www.ihmc.us/users/phayes









-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Microsoft Access for RDF?

2015-02-19 Thread Paul Houle
I think you're particular concerned about the ordering of triples where ?s
and ?p are the same?

?s ?p ?o1 , ?o2, ?o3, ?o4 .

right?

Sometimes people don't care,  sometimes they do.

One scenario is we are drawing a page about the creative works of Issac
Asimov,  the Kinks,  or Jeff Bridges and want to see titles that everybody
recognizes at the top of the list.  In some of these cases there is a
subjective element in that there is no global total ordering or even a
partial ordering (see Arrow's Theorem),  but it's fair to say more people
know the song Lola than Ducks on the Wall.

With RDF* you can put on edge weights which will support cases like that.

Another approach is to further specify the data model so you get behavior
like hashtables in PHP -- PHP hashtables support random access look up but
if you iterate over them things come out in the order you put them in.

Another case is where there really is a total ordering.  For instance,  the
authors of a scientific paper might get excited if you list them in the
wrong order.  One weird old trick for this is RDF containers,  which are
specified in the XMP dialect of Dublin Core


where you write something like

[
   dcterms:creator (Ralph Alpher,Hans Bethe,George Gamow) .
]

In their version you are using text and not some authority record but if
you just want to annotate the average business document it works.

So yeah,  I agree that ordering is a big issue in RDF;  you are coming from
it as a requirement for the editor,  I see it as important if we want to
use RDF technology as a universal solvent for dealing with all the JSON,
 XML and equivalently expressive documents and messages which support
ordered collection.  It's definitely a tooling issue but it does involve
looking at hard issues in the underlying data model.



On Thu, Feb 19, 2015 at 10:32 AM, Michael Brunnbauer bru...@netestate.de
wrote:


 Hello Paul,

 let me put this into two simple statements:

 1) There is no canonical ordering of triples

 2) A good triple editor should reflect this by letting the user determine
 the order

 Regards,

 Michael Brunnbauer

 On Thu, Feb 19, 2015 at 03:50:33PM +0100, Michael Brunnbauer wrote:
 
  Hello Paul,
 
  I am not so sure if this is good enough. If you add something to the end
 of a
  list in a UI, you normally expect it to stay there. If you accept that it
  will be put in its proper position later, you may - as user - still have
  trouble figuring out where that position is (even with the heuristics
 you gave).
 
  The problem repeats with the triple object if the properties have been
 ordered.
  As user, you might feel even more compelled to introduce a deviant
 ordering on
  this level.
 
  Regards,
 
  Michael Brunnbauer
 
  On Thu, Feb 19, 2015 at 09:07:37AM -0500, Paul Houle wrote:
   There are quite a few simple heuristics that will give good enough
   results,  consider for instance:
  
   (1) order predicates by alphabetical order (by rdfs:label or by
 localname
   or the whole URL)
   (2) order predicates by some numerical property given by a custom
 predicate
   in the schema
   (3) order predicates by the type of the domain alphabetically, and then
   order by the name of the predicates
   (4) work out the partial ordering of types by inheritance so Person
 winds
   up at the top and Actor shows up below that
  
   Freebase does something like (4) and that is good enough.
  
   On Thu, Feb 19, 2015 at 8:01 AM, Kingsley Idehen 
 kide...@openlinksw.com
   wrote:
  
On 2/19/15 4:52 AM, Michael Brunnbauer wrote:
   
Hello Paul,
   
an interesting aspect of such a system is the ordering of triples -
 even
if you restrict editing to one subject. Either the order is
 predefined
and the
user will have to search for his new triple after doing an insert
 or the
user
determines the position of his new triple.
   
In the latter case, the app developer will want to use something
 like
reification - at least internally. This is the point when the app
developer
and the Semantic Web expert start to disagree ;-)
   
   
Not really, in regards to Semantic Web expert starting to disagree
 per
se. You can order by Predicate or use Reification.
   
When designing our RDF Editor, we took the route of breaking things
 down
as follows:
   
Book (Named Graph Collection e.g. in a Quad Store or service that
understands LDP Containers etc..)  -- (contains) -- Pages (Named
 Graphs)
-- Paragraphs (RDF Sentence/Statement Collections).
   
The Sentence/Statement Collections are the key item, you are honing
 into,
and yes, it boils down to:
   
1. Grouping sentences/statements by predicate per named graph to
 create a
paragraph
2. Grouping sentences by way of reification where each sentence is
identified and described per named graph.
   
Rather that pit one approach against the other, we simply adopted
 both, as
options.
   
Anyway, you raise a very

Re: Microsoft Access for RDF?

2015-02-19 Thread Paul Houle
There are quite a few simple heuristics that will give good enough
results,  consider for instance:

(1) order predicates by alphabetical order (by rdfs:label or by localname
or the whole URL)
(2) order predicates by some numerical property given by a custom predicate
in the schema
(3) order predicates by the type of the domain alphabetically, and then
order by the name of the predicates
(4) work out the partial ordering of types by inheritance so Person winds
up at the top and Actor shows up below that

Freebase does something like (4) and that is good enough.

On Thu, Feb 19, 2015 at 8:01 AM, Kingsley Idehen kide...@openlinksw.com
wrote:

 On 2/19/15 4:52 AM, Michael Brunnbauer wrote:

 Hello Paul,

 an interesting aspect of such a system is the ordering of triples - even
 if you restrict editing to one subject. Either the order is predefined
 and the
 user will have to search for his new triple after doing an insert or the
 user
 determines the position of his new triple.

 In the latter case, the app developer will want to use something like
 reification - at least internally. This is the point when the app
 developer
 and the Semantic Web expert start to disagree ;-)


 Not really, in regards to Semantic Web expert starting to disagree per
 se. You can order by Predicate or use Reification.

 When designing our RDF Editor, we took the route of breaking things down
 as follows:

 Book (Named Graph Collection e.g. in a Quad Store or service that
 understands LDP Containers etc..)  -- (contains) -- Pages (Named Graphs)
 -- Paragraphs (RDF Sentence/Statement Collections).

 The Sentence/Statement Collections are the key item, you are honing into,
 and yes, it boils down to:

 1. Grouping sentences/statements by predicate per named graph to create a
 paragraph
 2. Grouping sentences by way of reification where each sentence is
 identified and described per named graph.

 Rather that pit one approach against the other, we simply adopted both, as
 options.

 Anyway, you raise a very important point that's generally overlooked.
 Ignoring this fundamental point is a shortcut to hell for any editor that's
 to be used in a multi-user setup, as you clearly understand :)


 Kingsley


 Maybe they can compromise on a system with a separate named graph per
 triple
 (BTW what is the status of blank nodes shared between named graphs?).

 Regards,

 Michael Brunnbauer

 On Wed, Feb 18, 2015 at 03:08:33PM -0500, Paul Houle wrote:

 I am looking at some cases where I have databases that are similar to
 Dbpedia and Freebase in character,  sometimes that big (ok,  those
 particular databases),   sometimes smaller.  Right now there are no blank
 nodes,  perhaps there are things like the compound value types from
 Freebase which are sorta like blank nodes but they have names,

 Sometimes I want to manually edit a few records.  Perhaps I want to
 delete
 a triple or add a few triples (possibly introducing a new subject.)

 It seems to me there could be some kind of system which points at a
 SPARQL
 protocol endpoint (so I can keep my data in my favorite triple store) and
 given an RDFS or OWL schema,  automatically generates the forms so I can
 easily edit the data.

 Is there something out there?

 --
 Paul Houle
 Expert on Freebase, DBpedia, Hadoop and RDF
 (607) 539 6254paul.houle on Skype   ontolo...@gmail.com
 http://legalentityidentifier.info/lei/lookup



 --
 Regards,

 Kingsley Idehen
 Founder  CEO
 OpenLink Software
 Company Web: http://www.openlinksw.com
 Personal Weblog 1: http://kidehen.blogspot.com
 Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
 Twitter Profile: https://twitter.com/kidehen
 Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
 LinkedIn Profile: http://www.linkedin.com/in/kidehen
 Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Microsoft Access for RDF?

2015-02-18 Thread Paul Houle
I am looking at some cases where I have databases that are similar to
Dbpedia and Freebase in character,  sometimes that big (ok,  those
particular databases),   sometimes smaller.  Right now there are no blank
nodes,  perhaps there are things like the compound value types from
Freebase which are sorta like blank nodes but they have names,

Sometimes I want to manually edit a few records.  Perhaps I want to delete
a triple or add a few triples (possibly introducing a new subject.)

It seems to me there could be some kind of system which points at a SPARQL
protocol endpoint (so I can keep my data in my favorite triple store) and
given an RDFS or OWL schema,  automatically generates the forms so I can
easily edit the data.

Is there something out there?

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Microsoft Access for RDF?

2015-02-18 Thread Paul Houle
Well here is my user story.

I am looking at a page that looks like this

http://dbpedia.org/page/Albert_Einstein

it drives me up the wall that the following facts are in there:

:Albert_Einstein
   dbpedia-owl:childOf :EinsteinFamily ;
   dbpedia-owl:parentOf :EinsteinFamily .

which is just awful in a whole lot of ways.  Case (1) is that I click an X
next to those two triples and they are gone,  Case (2) is that I can create
new records to fill in his Family tree and that will involve many other
user stories such as (3) user creates literal field and so forth.

Freebase and Wikidata have good enough user interfaces that revolve
around entities,  see

https://www.freebase.com/m/0jcx  (while you still can)
http://www.wikidata.org/wiki/Q937

but neither of those is RDF-centric.  It seems to me that an alternative
semantics could be defined for RDF and OWL that work work like so:

* we say :Albert_Einstein is a :Person
* we then see some form with fields people can fill,  or alternately there
is a dropdown list with predicates that have this as a known domain;  the
range can also be used backwards so that we expect a langstring or integer
or link to another :Person

It's important that this be some tool that somebody who knows little about
RDF can enter data and edit it with a little bit of task-oriented (as
opposed to concept-oriented training.)

The idea here is that the structures and vocabulary are constrained so that
the structures are not complex;  both DBpedia and Freebase are so
constrained.  You might want to say things like

[ a   sp:Ask ;
rdfs:comment must be at least 18 years old^^xsd:string ;
sp:where ([ sp:object sp:_age ;
sp:predicate my:age ;
sp:subject spin:_this
  ] [ a   sp:Filter ;
sp:expression
[ sp:arg1 sp:_age ;
  sp:arg2 18 ;
  a sp:lt
]
  ])
]


and that is cool,  but I have no idea how to make that simple for a muggle
to use and I'm interested in these things that are similar in character to
a relational database,  so I'd say that is out-of-scope for now.  I think
this tool could probably edit RDFS schemas (treating them as instance data)
but not be able to edit OWL schemas (if you need that use an OWL editor)

Now if I was really trying to construct family trees I'd have to address
cycles like that with some algorithms and heuristics because it probably
take a long time to pluck them out by hand,  but some things you'll want to
edit by hand and that process will be easier if you are working with a
smaller data set,  which you can easily find.

If you have decent type data,  as does Freebase,  it is not hard to pick
out pieces of the WikiWorld,  such as ski areas or navy ships and the
project of improving that kind of database with hand tools is much more
tractable.

For small projects you don't need access controls,  provenance and that
kind of thing,  but if you were trying to run something like Freebase and
Wikidata where you know what the algebra is the obvious thing to do is use
RDF* and SPARQL*.


Re: Microsoft Access for RDF?

2015-02-18 Thread Paul Houle
Yes,  there is the general project of capturing 100% of critical
information in documents and that is a wider problem than the large amount
of Linked Data which is basically RelationalDatabase++.

Note that a lot of data in this world is in spreadsheets (like relational
tables but often less discipline) and in formats like XML and JSON that are
object-relational in the sense that a column can contain either a list or
set of rows.

Even before we tackle the problem of representing the meaning of written
language (particularly when it comes to policies,  standards documents,
 regulations,  etc. as opposed to Finnegan's Wake or Mary had a little
lamb) there is the slightly easier problem of understanding all the
semi-structured data out there.

Frankly I think the Bible gets things all wrong at the very beginning with
In the beginning there was the word... because in the beginning we were
animals and then we developed the language instinct,  which is probably a
derangement in our ability to reason about uncertainty which reduces the
sample space for learning grammar.

Often people suggest that animals are morally inferior to humans and I
think the bible has a point where in some level we screw it up,  because
animals don't do the destructive behaviors that seem so characteristic of
our species and that are often tied up with our ability to use language to
construct contafactuals such as the very idea that there is some book that
has all the answers because once somebody does that,  somebody else can
write a different book and say the same thing and then bam your are living
after the postmodern heat death,  even a few thousand years before the
greeks.

(Put simply:  you will find that horses,  goats,  cows,  dogs,  cats and
other domesticated animals consistently express pleasure when you come to
feed them,  which reinforces their feeding.  A human child might bitch that
they aren't getting what they want,  which does not reinforce parental
feeding behavior.  Call it moral or social reasoning or whatever but when
it comes to maximizing one's utility function,  animals do a good job of
it.  The only reason I'm afraid to help an injured raccoon is that it might
have rabies.)

Maybe the logical problems you get as you try to model language have
something to do with human nature,  but the language instinct is a
peripheral of an animal and it can't be modeled without modeling the animal.

There is a huge literature of first order logic,  temporal logic,  modal
logic and other systems that capture more of what is in language and the
question of what comes after RDF is interesting;  the ISO Common Logic
idea that we go back to the predicate calculus and just let people make
statements with arity 2 in a way that expands RDF is great,  but we really
need ISO Common Logic* based on what we know now.  Also there is no
conceptual problem in introducing arity  2 in SPARQL so we should just do
it -- why convert relational database tables to triples and add the
complexity when we can just treat them as tuples under the SPARQL algebra?

Anyway there is a big way to go in this direction and I have thought about
it deeply because I have stared into the commercialization valley of death
for so long,  but I think an RDF editor for Linked Data as we know it
which is more animal in it's nature than human is tractable and useful and
maybe a step towards the next thing.






On Wed, Feb 18, 2015 at 6:28 PM, Gannon Dick gannon_d...@yahoo.com wrote:

 Hi Paul,

 I'm detecting a snippy disturbance in the Linked Open Data Force :)

 The text edit problem resides in the nature of SQL type queries vs. SPARQL
 type queries.  It's not in the data exactly, but rather in the processing
 (name:value pairs).  To obtain RDF from data in columns you want to do a
 parity shift rather than a polarity shift.

 Given the statement:

 Mad Dogs and Englishmen go out in the midday sun

 (parity shift) Australians are Down Under Englishmen and just as crazy.
 (polarity shift) Australians are negative Englishmen, differently crazy.

 Mad Dogs ? Well, that's another Subject.

 The point is, editing triples is not really any easier than editing
 columns, but it sometimes looks dangerously easy.

 -Gannon

 [1]  'Air and water are good, and the people are devout enough, but the
 food is very bad,' Kim growled; 'and we walk as though we were mad--or
 English. It freezes at night, too.'
 --  Kim by Rudyard Kipling (Joseph Rudyard Kipling (1865-1936)), Chapter
 XIII, Copyright 1900,1901
 
 On Wed, 2/18/15, Paul Houle ontolo...@gmail.com wrote:

  Subject: Microsoft Access for RDF?
  To: Linked Data community public-lod@w3.org
  Date: Wednesday, February 18, 2015, 2:08 PM

  I am looking at some
  cases where I have databases that are similar to Dbpedia and
  Freebase in character,  sometimes that big (ok,  those
  particular databases),   sometimes smaller.  Right now
  there are no blank nodes,  perhaps

Re: How do you explore a SPARQL Endpoint?

2015-01-24 Thread Paul Houle
Just to give some example of what the situation is, consider the profile of
Dublin Core which is specified at page 32 in the following document

http://www.adobe.com/content/dam/Adobe/en/devnet/xmp/pdfs/XMPSpecificationPart1.pdf

which quotes

The XMP data modelling of these is consistent with apparent Dublin Core
intent, but specific to XMP.
As a corollary of the data modelling, the RDF serialization of Dublin Core
in XMP might not exactly match other RDF usage of Dublin Core element set.
XMP does not “include Dublin Core” in any fuller sense.

I'd agree with them that the DC vocabulary was incomplete and failed to
address a number of critical issues with DC at that time,  such as what to
do when a work has multiple authors (people care about that.)  I can pick
many nits,  but the XMP profile would do what most people would want to
do.  It does not compete with MARC for bibliographic supremacy,  but would
it be good for the average dogpile of documents,  yes it would.

Other people have their own answers to what is unclear in DC,  so it's fair
to say that a triple store full of Dublin Core vocabulary could have a
lot of stuff in it.  For instance,  some people might use strings and
others might use VIAF identifiers or links to DBpedia or whatever.

If you're aggregating at you will run into this and you really need a
systematic way of solving the puzzle one bit at a time.


Re: linked open data and PDF

2015-01-23 Thread Paul Houle
I don't think that is true;  I see this example in the XMP spec

rdf:RDF xmlns:rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#;
xmlns:xmp=http://ns.adobe.com/xap/1.0/;
rdf:Description rdf:about=
xmp:BaseURL rdf:resource=http://www.adobe.com//
/rdf:Description
/rdf:RDF

Isn't this the same as

 xmp:BaseURL http://www.adobe.com/ . ?

On Fri, Jan 23, 2015 at 4:45 AM, Michael Brunnbauer bru...@netestate.de
wrote:


 Hello Larry,

 On Tue, Jan 20, 2015 at 05:28:38PM +, Larry Masinter wrote:
  Image formats like JPEG and PNG (for which there
  is support for XMP) don't have a standard, uniform
  way of attaching other files, though, so allowing
  data (or a pointer to external data) in the XMP
  would broaden the applicability.

 I am right that such a pointer to external data would have to be a literal?

 The way I read the XMP standard is that only literals, blank nodes, rdf:Bag
 and rdf:Seq are allowed as object of a triple. This would rule out standard
 options like:

   this_document rdfs:seealso rdf_document_about_pdf_document
   this_document owl:sameAs pdf_document_uri

 Which would enable agents to find and regognize RDF data about the
 document.

 Regards,

 Michael Brunnbauer

 --
 ++  Michael Brunnbauer
 ++  netEstate GmbH
 ++  Geisenhausener Straße 11a
 ++  81379 München
 ++  Tel +49 89 32 19 77 80
 ++  Fax +49 89 32 19 77 89
 ++  E-Mail bru...@netestate.de
 ++  http://www.netestate.de/
 ++
 ++  Sitz: München, HRB Nr.142452 (Handelsregister B München)
 ++  USt-IdNr. DE221033342
 ++  Geschäftsführer: Michael Brunnbauer, Franz Brunnbauer
 ++  Prokurist: Dipl. Kfm. (Univ.) Markus Hendel




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: linked open data and PDF

2015-01-21 Thread Paul Houle
  I think the world needs a survey of XMP metadata in the field.  Only
by inspection of a large set of diverse files can we say how good or bad
the situation actually is.

  There ought to be a tool that gives XMP-annotated documents a point
score for metadata quality;  you ought to get a lot of points for having
the simple things that were missing in the document exported from word like
the title, author,  copyright,  etc.

  Note it is not just about PDF but many kinds of media files that are
tagged with this,  so it really is about XMP,  not just PDF.

On Wed, Jan 21, 2015 at 11:01 AM, Norman Gray nor...@astro.gla.ac.uk
wrote:


 Greetings.

  On 2015 Jan 20, at 14:42, Herbert Van de Sompel hvds...@gmail.com
 wrote:
 
  Larry
  How about HTTP Link headers (RFC 5988) to convey links and metadata
 expressed as links when serving PDFs? I can imagine an authoring tool
 embedding the info in XMP. But I have a harder time imagining a consumer
 application that would want to read the info via XMP.
 
  I don't: Bibliographic managers for PDFs could make use of XMP
 metadata. Imagine never typing another citation again!
 
  I thought bibliographic managers already did. If I remember correctly,
 some STM publishers did stuff metadata in XMP containers at one point. And
 I would have thought that bibliographic managers would then have used that.
 It would be telling if they didn't.

 And similarly I can imagine a PDF viewer -- or MP3 player, or web browser
 processing images -- looking in an XMP packet to find title and licence
 information, or a MP3-managing application using metadata found there to
 cluster things based on publisher, country, licence again, and so on; or
 find me all the content with a licence I can mash up  Metadata good!

 There's a vicious circle, though.  Developers aren't aware of RDF in
 general and XMP in particular, so don't look for information there.  If
 they do, they don't know how to parse it (XMP is a profile of RDF/XML, so
 it's easy to read but a bit of a pain in the neck to write (though librdf's
 Raptor can do it http://librdf.org/raptor/)).  If they look for
 parsers, then they're likely to find one of the bg SemWeb frameworks
 rather than a little parsing library they can drop into their application.
 So they give up, even if they get that far.  And because (approximately)
 no-one reads this stuff, no-one writes it, and because no-one writes it,
 no-one forces developers to push through the pain and read it.

 What would break that deadlock would be (i) a killer tool depending on
 XMP, which makes its users nag content producers to include the stuff, or
 (ii) content producers routinely making the stuff available, in the hope
 that (i) turns up.  Hmm: I'm not holding my breath.

 Your other proposal:

  But, anyhow, I admit my above sentence wasn't all that clear. I was
 really expressing my doubt that an on web Linked Data aggregating tool
 would start doing something special when encountering a PDF: pulling it
 across and then check whether anything interesting was in an XMP
 container.  In my below proposal, an HTTP HEAD (needed anyhow to figure out
 whether a resource is a PDF) would suffice to obtain the links if they were
 provided in the HTTP Link header. Using IANA-registered relation types,
 those links would end up being a bit generic but they would be readily
 available. And rather easily transformable into RDF.

 ...is interesting, but I don't think you necessarily need to involve the
 web to make an interesting scenario.  It's an odd thing to enthuse about on
 a semweb list, but the nice thing about embedded XMP is ... that it's
 embedded, so it can't get lost, and no ConNeg agony is involved in its
 extraction!

 All the best,

 Norman


 --
 Norman Gray  :  http://nxg.me.uk
 SUPA School of Physics and Astronomy, University of Glasgow, UK





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: linked open data and PDF

2015-01-21 Thread Paul Houle
You should be able to pipe the InputStream that comes out of a PDF filew/
PDFBox into Jena or some other RDF toolset.  A much more challenging issue
is developing a serializer which will take some triples from Jena (or other
toolset) and make sure 100% that it will validate as XMP.

I'm thinking of a user story that I have a large number of PDF documents on
my personal computer and several mobile devices.  I read standards
documents the way I used to read science fiction as a kid.  I read great
classics of science fiction in PDF,  get bank statements in PDF,  send
marketing material to customers as PDF,  I want to be on top of my PDF.

I think the same is true for many foaf:Agents.

One issue I see is what it is all rdf:About,  or what  means.  In the
context of a single document this is not a problem but a simple linking
scenario would be to scan a collection of PDF documents and get all of the
RDF into a triple store.

If I look at something that's a bad smell to me it is the use of a
non-standard date,  in particular something derived from

http://www.w3.org/TR/NOTE-datetime

which,  like xsd:datetime,  is a derivative of the ISO 8601 standard which
(like IEEE 754) has never been read by anybody. Unlike xsd:datetime this
allows dropping components off the RHS,  but at least it doesn't allow the
fifth digit in the year that xsd:datetime does.

XML schema does have proper types for dates and times and years and
similar entities but the SPARQL standard does not implement an algebra that
handles non-exact datetimes properly.  (It should,  but there are lots of
issues,  such as this is an algebra of intervals so there not always a
total ordering.)

Another thing that bugs me are text lists in the specification for values
that are enumerated in the text files.  For instance,  looking at the the
descriptions of channel formats for sound I see the is an Other in the
recent specification but not way to specify any of the three competing
formats that support a height channel,  or that surround sound is likely to
migrate to being object based rather than channel based.  (Never mind the
4.0 quad mixes from the 1970s)  It would be nice to see some way to keep
this up to date.

And about ISO Standard,  Adobe's messaging ought to be very clear that
you can download these standards for free because that's not usually true
about ISO Standards.  We all need to get revenue,  but when official
standards are not freely available to end users they don't end up getting
used properly.



On Mon, Jan 19, 2015 at 5:35 PM, Martynas Jusevičius marty...@graphity.org
wrote:

 PDFBox includes metadata API, but does not mention RDF:
 https://pdfbox.apache.org/1.8/cookbook/workingwithmetadata.html

 On Mon, Jan 19, 2015 at 11:31 PM, Martynas Jusevičius
 marty...@graphity.org wrote:
  Hey all,
 
  I think APIs for common languages like Java and C# to extract XMP RDF
  from PDF Files/Streams would be much more helpful than standalone
  tools such as Paul mentions.
 
  I've looked at Adobe PDF Library SDK but none of the features mention
 metadata:
  http://www.adobe.com/devnet/pdf/library.html
 
 
  Martynas
  graphityhq.com
 
  On Mon, Jan 19, 2015 at 11:24 PM, Paul Houle ontolo...@gmail.com
 wrote:
  I just used Acrobat Pro to look at the XMP metadata for a standards
 document
  (extra credit if you know which one) and saw something like this
 
 
 https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG
 
  in this particular case this is fine RDF,  just very little of it
 because
  nobody made an effort to fill it in.  The lack of a title is
 particularly
  annoying when I am reading this document at the gym because it gets
 lost in
  a maze of twisty filenames that all look the same,
 
  I looked at some financial statements and found that some were very well
  annotated and some not at all.  Acrobat Pro has a tool that outputs the
 data
  in RDF/XML;  I can't imagine it is hard to get this data out with third
  party tools in most cases.
 
 
  On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter masin...@adobe.com
 wrote:
 
  I just joined this list. I’m looking to help improve the story for
 Linked
  Open Data in PDF, to lift PDF (and other formats) from one-star to
 five,
  perhaps using XMP. I’ve found a few hints in the mailing list archive
 here.
  http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html
  but I’m still looking. Any clues, problem statements, sample sites?
 
  Larry
  --
  http://larry.masinter.net
 
 
 
 
  --
  Paul Houle
  Expert on Freebase, DBpedia, Hadoop and RDF
  (607) 539 6254paul.houle on Skype   ontolo...@gmail.com
  http://legalentityidentifier.info/lei/lookup




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: linked open data and PDF

2015-01-20 Thread Paul Houle
I don't like the idea of unchecked expansion of formats because it puts an
open-ended burden on both Adobe and third party software developers.  If
there are specific problems with the current XMP format it makes sense to
transition to XMP2 but the more variation in formats the worse it will be
for the ecosystem.

On Mon, Jan 19, 2015 at 5:20 PM, Kingsley Idehen kide...@openlinksw.com
wrote:

 On 1/19/15 2:36 PM, Larry Masinter wrote:

 I just joined this list. I’m looking to help improve the story for Linked
 Open Data in PDF, to lift PDF (and other formats) from one-star to five,
 perhaps using XMP. I’ve found a few hints in the mailing list archive here.
 http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html
 but I’m still looking. Any clues, problem statements, sample sites?

 Larry
 --
 http://larry.masinter.net


 Larry,

 Rather than only supporting an XML based notation (i.e., RDF/XML) for
 representing RDF triples, it would be nice if one could attain the same
 goal (embedded metadata in PDFs) using any RDF notation e.g., N-Triples,
 TURTLE, JSON-LD etc..

 The above would also include creating an XMP ontology that's represented
 in at least TURTLE and JSON-LD notations -- which do have strong usage
 across different RDF user  developer profiles.

 Naturally, Adobe apps that already process XMP simply need to leverage a
 transformation processor (built in-house or acquired) that converts
 metadata represented in TURTLE and JSON-LD to RDF/XML (which is what they
 currently supports).

 --
 Regards,

 Kingsley Idehen
 Founder  CEO
 OpenLink Software
 Company Web: http://www.openlinksw.com
 Personal Weblog 1: http://kidehen.blogspot.com
 Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
 Twitter Profile: https://twitter.com/kidehen
 Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
 LinkedIn Profile: http://www.linkedin.com/in/kidehen
 Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: linked open data and PDF

2015-01-19 Thread Paul Houle
I just used Acrobat Pro to look at the XMP metadata for a standards
document (extra credit if you know which one) and saw something like this

https://raw.githubusercontent.com/paulhoule/images/master/MetadataSample.PNG

in this particular case this is fine RDF,  just very little of it because
nobody made an effort to fill it in.  The lack of a title is particularly
annoying when I am reading this document at the gym because it gets lost in
a maze of twisty filenames that all look the same,

I looked at some financial statements and found that some were very well
annotated and some not at all.  Acrobat Pro has a tool that outputs the
data in RDF/XML;  I can't imagine it is hard to get this data out with
third party tools in most cases.


On Mon, Jan 19, 2015 at 2:36 PM, Larry Masinter masin...@adobe.com wrote:

 I just joined this list. I’m looking to help improve the story for Linked
 Open Data in PDF, to lift PDF (and other formats) from one-star to five,
 perhaps using XMP. I’ve found a few hints in the mailing list archive here.
 http://lists.w3.org/Archives/Public/public-lod/2014Oct/0169.html
 but I’m still looking. Any clues, problem statements, sample sites?

 Larry
 --
 http://larry.masinter.net




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Is SPIN still a valid direction?

2014-12-27 Thread Paul Houle
And speaking of which,  if there is a Jena implementation it also should be
mentioned at

http://spinrdf.org/

which currently shows three versions of the same product.  It's commonly
believed that you ought to quote three price points in your marketing
material,  however,  the order on that site isn't even the correct order
(cheap  expensive) which leads people to think the medium product is
cheap.  Anyway,  so long as as this page shows only one vendor,  there is a
dangerous perception that it is a single vendor product and I can tell you
some people just don't want to deal with such a thing.

On Thu, Dec 25, 2014 at 4:44 AM, lo...@pa.icar.cnr.it wrote:

 Great news Kingsley, are you allowed to share when this rule update will
 be public? Im using Virtuoso in my project and would be good to have SPIn
 support there instead than using Jena as wrapper...

 Merry xmas to the list!
  On 12/24/14 5:07 PM, Paul Houle wrote:
  Jerven,  I'd like to see the implementation list at
 
  http://spinrdf.org/
 
  updated to reflect the Allegrograph implementation and any others you
  can find.
 
  We are supporters, for the record. Implementation coming as part of a
  major rules related enhancement to Virtuoso.
 
  Seasons Greetings to All!
 
  --
  Regards,
 
  Kingsley Idehen
  Founder  CEO
  OpenLink Software
  Company Web: http://www.openlinksw.com
  Personal Weblog 1: http://kidehen.blogspot.com
  Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
  Twitter Profile: https://twitter.com/kidehen
  Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
  LinkedIn Profile: http://www.linkedin.com/in/kidehen
  Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this
 
 
 






-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Is SPIN still a valid direction?

2014-12-24 Thread Paul Houle
Jerven,  I'd like to see the implementation list at

http://spinrdf.org/

updated to reflect the Allegrograph implementation and any others you can
find.

On Tue, Dec 23, 2014 at 2:24 AM, Jerven Bolleman m...@jerven.eu wrote:

 Hi Antonino,

 SPIN is a very active and widely adopted (for a semantic web technology).
 Besides Topquadrant (the originators) and Alegrograph, I know of 3 more
 SPARQL vendors who have
 committed resources and started development work on implementing SPIN
 support, expect announcements
 around summer next year.
 And that is just among those I have contact with (which might be a biased
 sample ; but 75% of the sampled vendors)

 The W3C shapes workgroup is focussed on a sub part of the SPIN
 functionality.
 Validation and documentation, however, the SPIN supporters/inventors are
 very active in
 that workgroup and it is very likely SPIN (evolved) will be a large part
 of that standard.

 As you are interested in inferencing and open data workflows the W3C
 shapes workgroup work is
 not of interest to your core functionality. See e.g.
 http://www.w3.org/2014/data-shapes/wiki/Requirements

 In the end SPIN is translated before execution and can be used on SPARQL
 implementations that do
 not support SPIN yet. SPIN is not so widely used in research but has a
 significant number of paying
 clients and complicated projects behind it.

 I personally have used it on different projects and am very happy with the
 great and active support
 on the TopQuadrant mailing list. I would also love to work on a OpenRDF
 Sesame implementation, but just lack
 the time to do so. Once someone has implemented a SPARQL engine SPIN is a
 relatively simple technology. Just translate a RDF
 representation of the SPARQL query into the SPARQL algebra model inside
 your engine (if needed via a text representation).

 Hoping this is helpful,

 Regards,
 Jerven


 On 19 Dec 2014, at 18:33, Antonino Lo Bue lo...@pa.icar.cnr.it wrote:

  Hi everyone,
 
  I'm wondering if someone from the list could make a clear point on SPIN
  adoption and usage status. I'm planning to use it in my research work to
  model SPARQL inferencing on Open data-Linked open data workflows , but I
  have heard that something new is coming and would/could replace SPIN with
  a more flexible language.
  Is this the case and so I could risk to work with outdated and legacy
  stuff? Or do you encourage the adoption?
 
  Thanks and regards
 
  Antonino Lo Bue
  CNR-ICAR Palermo
  LinkedIn: http://www.linkedin.com/in/antoninolobue
 
 
 





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Microsoft OLE

2014-12-15 Thread Paul Houle
Most Windows programmers would instantiate OLE objects in the applications
and query them to get results;  commonly people write XML or JSON APIs,
 but writing RDF wouldn't be too different.

The next step up is to have a theory that converts OLE data structures to
and from RDF either in general or in a specific case with help from a
schema.  Microsoft invested a lot in making SOAP work well with OLE,  so
you might do best with a SOAP to RDF mapping.

This caught my eye though,  because I've been looking at the relationships
between RDF and OMG,  a distant outpost of standardization.  You can find
competitive products on the market,  one based on UML and another based on
RDF, OWL, SKOS and so forth.  The products do more or less the same thing,
 but described in such different language and vocabulary that it's hard to
believe that they compete for any sales.

There is lots of interesting stuff there,  but the big theme is ISO Common
logic,  which adds higher-arity predicates and a foundation for inference
that people will actually want to use.  It's not hard to convince the
enterprise that first-order-logic is ready for the big time because banks
and larger corporations all use FOL-based systems on production rules to
automate decisions.



On Sat, Dec 13, 2014 at 7:30 AM, Hugh Glaser h...@glasers.org wrote:

 Anyone know of any work around exposing OLE linked objects as RDF?
 I could envisage a proxy that gave me URIs and metadata for embedded
 objects.

 Is that even a sensible question? :-)

 --
 Hugh Glaser
20 Portchester Rise
Eastleigh
SO50 4QS
 Mobile: +44 75 9533 4155, Home: +44 23 8061 5652





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: type of http://dbpedia.org/page/Bachelor_of_Arts

2014-10-14 Thread Paul Houle
I think the real problem here is that it is you can't make one database to
satisfy everyone's requirements.

For instance if you want to build a system to do reasoning about academic
and professional credentials,  this is easiest to build on top of an
ontology (data structures) that is designed for the task and with data that
is curated for the task.

Like other databases,  there is some serious overlap with DBpedia (enough
that you could populate or enrich a credentials database from DBpedia),
 but you're always going to fight with the notability requirement in
Wikipedia.  Sooner or later there will be concept that is essential to your
scheme that is not there.

That's no reason not to hook up with DBpedia,  but it is to recognize that
is not going to please everybody and a big part of the value is of an
exchange language -- and the fact that a DBpedia link is as good a link to
human documentation as it is to machine readable.


On Mon, Oct 13, 2014 at 10:33 AM, Valentina Presutti vpresu...@gmail.com
wrote:

 Hi Heiko,

 thanks for the prompt reply and the explanation.
 However, the interesting thing is that these entities are clearly used
 with more than one sense (at least in the US culture), so the issue comes
 from this fact originally in my opinion.
 I mentioned two cases here, but if you check you can see that all these
 types of entities (Degrees) have the same problem.

 My suggestion (if that can help) is to identify such metonym cases and
 have a special approach: having different entities as the number of senses.

 However, the Wikipedia page of such entities defines them as degrees…not
 sure if this can be useful to notice for you.

 Valentina

 On 13 Oct 2014, at 09:03, Heiko Paulheim he...@informatik.uni-mannheim.de
 wrote:

 Hi Valentina,

 (and CCing the DBpedia discussion list)

 this is an effect of the heuristic typing we employ in DBpedia [1]. It
 works correctly in many cases, and sometimes it fails - as for these
 examples (the classic tradeoff between coverage and precision).

 To briefly explain how the error comes into existence: we look at the
 distribution of types that occur for the ingoing properties of an untyped
 instance. For dbpedia:Bachelor_of_Arts, there are, among others, 208
 ingoing properties with the predicate dbpedia-owl:almaMater (which is
 already questionable). For that predicate, 87.6% of the objects are of type
 dbpedia-owl:University. So we have a strong pattern, with many supporting
 statements, and we conclude that dbpedia:Bachelor_of_Arts is a university.
 That mechanism, as I said, works reasonable well, but sometimes fails at
 single instances, like this one. For dbpedia:Academic_degree, you'll find
 similar questionable statements involving that instace, that mislead the
 heuristic typing algorithm.

 With the 2014 release, we further tried to reduce errors like these by
 filtering common nouns using WordNet before assigning types to instances,
 but both Academic degree and Bachelor of Arts escaped our nets here :-(

 The public DBpedia endpoint loads both the infobox based types and the
 heuristic types. If you need a clean version, I advise you to set up a
 local endpoint and load only the infobox based types into it.

 Best,
 Heiko

 [1] http://www.heikopaulheim.com/documents/iswc2013.pdf




 Am 13.10.2014 02:42, schrieb Valentina Presutti:

 Dear all,

 I noticed that dbpedia:Bachelor_of_Arts
 http://dbpedia.org/page/Bachelor_of_Arts, as well as other similar
 entities (dbpedia:Bachelor_of_Engineering, dbpedia:Bachelor_of_Science,
 etc.), is typed as dbpedia-owl:University
 I would expect a type like “Academic Degree” but if you look at
 dbpedia:Academic_Degree, its type is again dbpedia-owl:University

 however, its definition is (according to dbpedia):

 An academic degree is a college or university diploma, often associated
 with a title and sometimes associated with an academic position, which is
 usually awarded in recognition of the recipient having either
 satisfactorily completed a prescribed course of study or having conducted a
 scholarly endeavour deemed worthy of his or her admission to the degree.
 The most common degrees awarded today are associate, bachelor's, master's,
 and doctoral degrees.”

 Showing that there are at least two different meanings associated with the
 term: college/university and title.
 I thing that different meanings should be separated so as to allow
 applications to refer to the different entities: a university or a title.

 At least for me this causes errors in automatic relation extraction...

 Wdyt?

 Valentina


 --
 Prof. Dr. Heiko Paulheim
 Data and Web Science Group
 University of Mannheim
 Phone: +49 621 181 2646
 B6, 26, Room C1.08
 D-68159 Mannheim

 Mail: he...@informatik.uni-mannheim.de
 Web: www.heikopaulheim.com





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: scientific publishing process (was Re: Cost and access)

2014-10-06 Thread Paul Houle
 315, Carlsbad, CA. 92010*
 *Esperantolaan 4, Heverlee 3001, Belgium*
 http://www.atmire.com




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: Searching for references to a certain URI

2014-09-25 Thread Paul Houle
There is a hyperlink graph published here for the regular web

http://webdatacommons.org/hyperlinkgraph/index.html

it's a little big though.

On Thu, Sep 25, 2014 at 4:59 AM, Neubert Joachim j.neub...@zbw.eu wrote:

  What strategies do you use to find all references to a certain URI, e.g.
 http://d-nb.info/gnd/120273152, on the (semantic) web?



 I used Sindice for this, but sadly the service is discontinued, and the
 data becomes more and more outdated. Google link:/info: prefixes don’t
 work, because highly relevant links on web pages (e.g. from
 https://en.wikipedia.org/wiki/Horst_Siebert) are excluded by rel=nofollow
 links, and pure RDF links (e.g. from dbpedia) don’t show up at all.



 Cheers, Joachim




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: A Distributed Economy -- A blog involving Linked Data

2014-09-20 Thread Paul Houle
Why don't we just reorganize RDF to look like the predicate calculus,  let
the arity  2,  and then say it is something new so we can escape the RDF
name.

In fact,  let's just call it ISO Common Logic,

http://en.wikipedia.org/wiki/Common_logic

Most of the cool kids weren't around in the 1980s so they don't have the
bad taste in their mouth left by the predicate calculus.

I don't see any problem with extending SPARQL to arity  2 facts,  we can
let OWL dry and blow away.

On Sat, Sep 20, 2014 at 10:47 AM, Gannon Dick gannon_d...@yahoo.com wrote:

 +1, nicely put Kingsley
 --Gannon
 
 On Fri, 9/19/14, Kingsley Idehen kide...@openlinksw.com wrote:

  Subject: Re: A Distributed Economy -- A blog involving Linked Data
  To: public-lod@w3.org
  Date: Friday, September 19, 2014, 5:48 PM


  On 9/19/14 4:38 PM,
  Brent Shambaugh
wrote:



Manu Sporny's post titled Building
  Linked Data into the Core
  of the Web

  [
 http://lists.w3.org/Archives/Public/public-webpayments/2014Sep/0063.html]
  led to the question:


is linked data and semantic web tech useful? I
  think so. I
  can only speak from my own perspective and
  experience.



  Brent,



  Pasting my reply to Manu (with some editing) here, as I
  think its
  important:



  Manu,




  It is misleading (albeit inadvertent in regards to your
  post above)
  to infer that Linked Data isn't already the core of
  the Web. The
  absolute fact of the matter is that Linked Data has been
  the core of
  the Web since it was an idea [1][2].




  The Web doesn't work at all if HTTP URIs aren't
  names for:




  [1] What exists on the Web


  [2] What exists, period.




  We just have the misfortune of poor communications
  mucking up proper
  comprehension of AWWW. For example, RDF should have been
  presented
  to the world as an effort by the W3C to standardize an
  existing
  aspect of the Web i.e., the ability to leverage HTTP
  URIs as
  mechanisms for:




  1. entity identification  naming


  2. entity description using sentences or statements --
  where (as is
  the case re., natural language) a sentence or statement
  is comprised
  of a subject, predicate, and object.




  Instead, we ended up with an incomprehensible,
  indefensible, and at
  best draconian narrative that has forever tainted the
  letters
  R-D-F .  And to compount matters,
  HttpRange-14 has become
  censorship tool (based on its ridiculous history), that
  blurs fixing
  this horrible state of affairs.




  Links:




  [1] http://bit.ly/10Y9FL1
  -- Evidence that Linked Data was always at the core of
  the Web
  (excuse some instability on my personal data space
  instance, at this
  point in time, should you encounter issues looking up
  the document
  identified by this HTTP URI)




  [2]
 http://kidehen.blogspot.com/2014/03/world-wide-web-25-years-later.html
  -- World Wide Web, 25 years later




  [3]
 http://media-cache-ak0.pinimg.com/originals/04/b4/79/04b4794ccf2b6fd14ed3c822be26382f.jpg
  -- Illustrating identification (naming) on the Web (re.,
  things that
  exist on or off the Web medium) .




  --



  --
  Regards,

  Kingsley Idehen
  Founder  CEO
  OpenLink Software
  Company Web: http://www.openlinksw.com
  Personal Weblog 1: http://kidehen.blogspot.com
  Personal Weblog 2: http://www.openlinksw.com/blog/~kidehen
  Twitter Profile: https://twitter.com/kidehen
  Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
  LinkedIn Profile: http://www.linkedin.com/in/kidehen
  Personal WebID: http://kingsley.idehen.net/dataspace/person/kidehen#this






-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Freebase Identifiers New Browsing Interface for :BaseKB

2014-09-17 Thread Paul Houle
There is some more documentation on looking up Freebase identifiers in RDF

http://blog.databaseanimals.com/identifiers-in-freebase

Note also that there is now an unofficial browsing interface for :BaseKB
that works quite well,  and now at least something better than a 404
happens when you hit rdf.basekb.com

http://blog.databaseanimals.com/the-unofficial-basekb-browsing-interface

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
http://legalentityidentifier.info/lei/lookup


Re: URIs within URIs

2014-08-25 Thread Paul Houle
One of the advantages of bNodes is that they don't have names so that
people can't add things to them.  This is useful in the case of RDF
Collections and in places of the OWL spec where you can use them to say
that 'these things are in the collection' and others can't add to them.


On Mon, Aug 25, 2014 at 11:17 AM, Ruben Verborgh ruben.verbo...@ugent.be
wrote:

  bnodes are Semantic Web, but not Linked Data.
  If a node doesn't have a universal identifier, it cannot be addressed.
  I find this comment strange.
  If you mean that I can’t query using a bnode, then sure.
  If you mean that I never get any bnodes back as a result of a Linked
 Data URI GET, then I think not.

 Yes, you can get back bnodes.
 But the identifier of a bnode has only meaning in the document it is
 contained in.
 Hence, you cannot ask the server anything else about this bnode,
 because you don't have an identifier for it that exists outside of that
 one document.

 Therefore, it's maybe better to not get back bnodes at all;
 except if the server is sure the client cannot ask further meaningful
 questions about them
 (for instance, when all triples about a bnode were already in the response,
  as is the case with lists, and some other situations as well).

 Best,

 Ruben




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com


Re: Available approaches for keyword based querying RDF federations

2014-08-13 Thread Paul Houle
I would tend to stick up for a non-federated approach,  in the sense of
gathering a lot of federated data into a centralized knowledge base and
then querying that.  This is akin to how Google or Bing does web search by
crawling the web and forming a distributed index.

I can point to a number of reasons for this,  but some major ones are

* many of the better IR algorithms depend on corpus-wide statistics,  topic
modeling,  and other methods that need a global view (or at least a good
sample of a global view)
* even distributed search systems such as Solr (which in contrast to
federated search are well controlled because the machines are in the same
data center,  there is a deliberate approach to dealing with failures,
 etc.) are not terribly scalable for the following reason.  If you run
queries against N shards,  the time it takes to complete the query is
greater than the the maximum response time.  As N gets bigger the
probability that some glitch happens gets bigger and bigger.  Specifically
when N10 it is pretty hard to maintain an acceptable response time for
interactive use.

I'd say practically centralized search engines like Google and Bing have
won the internet search war.  For various reasons, meta-search,  deep web
search and similar services haven't really caught on.



On Wed, Aug 13, 2014 at 8:12 AM, Thilini Cooray 
thilinicooray.u...@gmail.com wrote:

 Hi,

 I would like to know available approaches for  keyword based querying RDF
 federations.

 I found the following approach :
 FedSearch: Efficiently Combining Structured Queries and Full-Text Search
 in a SPARQL Federation by
 Andriy Nikolov
 http://link.springer.com/search?facet-author=%22Andriy+Nikolov%22,
 Andreas Schwarte
 http://link.springer.com/search?facet-author=%22Andreas+Schwarte%22,
 Christian Hütter
 http://link.springer.com/search?facet-author=%22Christian+H%C3%BCtter%22

 I would like to know whether there are any other approaches.

 Regards,
 Thilini Cooray




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com


Re: Call for Linked Research

2014-07-30 Thread Paul Houle
I think's a little more than tax avoidance.  It's more that it seems much
easier for Elsevier to extort huge amounts of money than to get people to
pay a little bit of money for services that are inexpensive to provision.

If you take the amount that a commercial journal gets in subscription fees
and divide that by the number of papers it publishes you typically get a
number that is upwards of $10,000.

If you look at a well-run non-profit publisher,  such as the American
Physical Society,  it comes closer to $2000.

Neither of these figures counts the unpaid work of reviewers, the editorial
board,  etc.

When I worked at arXiv.org and divided the size of the budget by the number
of papers we handled,  we'd get a number more like $5 a paper.

arXiv could have been quite the sustainable business if it had managed to
get just 1/1000 the value per paper that commercial journal publishers get.

For a long time,  arXiv was able to run at Los Alamos labs but,  with the
Republicans in power (who tend to want to close LANL and move the weapons
work to LLNL) Paul Ginsparg decided it was time to get out and he brought
it to the Cornell Library.

When I was involved in the mid-00's arXiv represented perhaps 4% of the
budget of Cornell Library but probably delivered more value to end users
than the rest of the library put together -- back then,  50,000 scientists
got up every morning and looked at arXiv to see what was new in their
fields and now the numbers are certainly more than that.  The cost of
running arXiv was much smaller than the check that the library cut yearly
to Elsevier.

The short story is that CUL,  like most of the real jewels of Cornell,  was
seen as a cost center and not an opportunity center and faced intense
budget screw tightening and a lot of crazy stuff happened and one side
effect was that I left.  After about a decade of penury and confusion,
 arXiv finally got a sustainability plan that ensures it will continue in
penury

https://confluence.cornell.edu/display/culpublic/arXiv+five-year+member+pledges;jsessionid=9D9BC6ABCE2A76E4FDBC615553AA828B

This was all the more painful to endure because I saw so much larger amount
of funding going down various black holes.  For instance,  there was the
postdoc in the office next door who was supposed to use a supercomputer to
analyze the usage log of a project that cost $2 M a year to develop,
 except after extracting the robots he could have printed out the logs on a
line printer and done the analysis by hand (I'm not kidding about this!)
 Then there was the foundation that got a $20M endowment to make a handful
of journals available to a handful of journals in a handful of 4th world
countries.  if arXiv had gotten that,  it would be free papers for everyone
everywhere forever.



I've recently developed a system for scalable RDF publishing pretty much at
cost.  The first round of products includes

https://aws.amazon.com/marketplace/pp/B00KDO5IFA

that offer unlimited access with no throttling since users pay for their
own hardware.  (Practically this means:  try to use the DBpedia SPARQL
endpoint or Freebase MQL = PROJECT FAILURE,  use RDFeasy = IT JUST WORKS)

I'm not going to accept any objections that this product is too expensive
because at 45 cents an hour it is 5% of the cost of a minimum wage worker
in the U.S. and if you are using it for RD you probably only need to run
it when that worker is working.  I also think it would be hard to save
money rolling it on your own if you think people's labor is worth anything
at all,  because it doesn't take very much screwing around to waste $200 of
labor,  even at grad student rates.

And while I'm ranting,  I'll also call your attention to this

https://www.gittip.com/paulhoule/

This is a campaign where I collect money to pay my server bills.  If I get
more money,  I can offer more services and spend more time improving things
(HELPING *YOU* SUCCEED AT YOUR PROJECTS)  I'm grateful to the people who
are contributing,  but the way things now I am spending a lot of time
hustling up work (i.e. helping companies like Elsevier keep Lucene 3
installations running) rather than doing the work I can do best.  Money you
donate here does not go to university administration overhead,  owners of
San Francisco real estate,  or other leeches and rent seekers.

If you have an issue with :BaseKB you can talk with me and I can do
something about it.  If you have an issue with Freebase,  go talk to the
hand at the evil empire.


ᐧ


On Wed, Jul 30, 2014 at 9:50 AM, Gannon Dick gannon_d...@yahoo.com wrote:


 
 On Wed, 7/30/14, Giovanni Tummarello g.tummare...@gmail.com wrote:

  So Sarvem let us be rational and pick Occam's razor style simplest
 explanation ...

 By lex parsimonae (Occam's Razor) Tax Avoidance is magic.
 By Bell's Theorem, Tax Avoidance is theft (of services).
 Theft of Software As A Service is ... making me dizzy and diz-interested.








-- 
Paul

Re: Call for Linked Research

2014-07-29 Thread Paul Houle
 that goes into it is what you and others can get back. If you are content
 to not be able to discover interesting or relevant parts of others people's
 knowledge using the technologies and tools that's in front of you, there is
 nothing to debate about here.

 Like I said, it is mind-boggling to think that the SW/LD research
 community is stuck on 1-star Linked Data. Is that sinking in yet?

 -Sarven
 http://csarven.ca/#i




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com


More Freebase-in-RDF documentation

2014-07-29 Thread Paul Houle
http://blog.databaseanimals.com/how-to-introspect-the-freebase-schema-with-sparql
http://blog.databaseanimals.com/compound-value-types-in-rdf
http://blog.databaseanimals.com/how-to-write-sparql-queries-against-freebase-data

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
ᐧ


Re: Call for Linked Research

2014-07-28 Thread Paul Houle
I'd add to all of this publishing the raw data,  source code,  and
industrialized procedures so that results are truly reproducible,  as
few results in science actually are.
ᐧ

On Mon, Jul 28, 2014 at 9:01 AM, Sarven Capadisli i...@csarven.ca wrote:
 Call for Linked Research
 

 Purpose: To encourage the do it yourself behaviour for sharing and reusing
 research knowledge.

 Deadline: As soon as you can.

 From http://csarven.ca/call-for-linked-research :


 Scientists and researchers who work in Web Science have to follow the rules
 that are set by the publisher; researchers need to have read and reuse
 access to other researchers work, and adopt archaic desktop-native
 publishing workflows. Publishers try to remain as the middleman for
 society’s knowledge acquisition.

 Nowadays, there is more machine-friendly data and documentation made
 available by the public sector than the Linked Data research community. The
 general public asks for open and machine-friendly data, and they are
 following up. Web research publishing on the other hand, is stuck on one ★
 (star) Linked Data deployment scheme. The community has difficulty eating
 its own dogfood for research publication, and fails to deliver its share of
 the promise.

 There is a social problem. Not a technical one. If you think that there is
 something fundamentally wrong with this picture, want to voice yourself, and
 willing to continue to contribute to the Semantic Web vision, then please
 consider the following before you write about your research:

 Linked Research: Do It Yourself

 1. Publish your research and findings at a Web space that you control.

 2. Publish your progress and work following the Linked Data design
 principles. Create a URI for everything that is of some value to you and may
 be to others e.g., hypothesis, workflow steps, variables, provenance,
 results etc.

 3. Reuse and link to other researchers URIs of value, so nothing goes to
 waste or reinvented without good reason.

 4. Provide screen and print stylesheets, so that it is legible on screen
 devices and can be printed to paper or output to desktop-native document
 formats. Create a copy of a view for the research community to fulfil
 organisational requirements.

 5. Announce your work publicly so that people and machines can discover it.

 6. Have an open comment system policy for your document so that any person
 (or even machines) can give feedback.

 7. Help and encourage others to do the same.

 There is no central authority to make a judgement on the value of your
 contributions. You do not need anyone’s permission to share your work, you
 can do it yourself, meanwhile others can learn and give feedback.

 -Sarven
 http://csarven.ca/#i





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Call for Linked Research

2014-07-28 Thread Paul Houle
Industrialization is about simplifying procedures and documenting this
that you don't need grad students and postdocs to do everything.  To
understand more of what I mean,  look at

http://www.amazon.com/The-E-Myth-Revisited-Small-Businesses/dp/0887307280
http://www.amazon.com/Quality-Free-Certain-Becomes-Business/dp/0070145121/
http://www.amazon.com/Checklist-Manifesto-How-Things-Right/dp/031243/
http://www.amazon.com/Out-Crisis-W-Edwards-Deming/dp/0262541157/

If you're interested in the question of work life balance I may suggest

http://www.amazon.com/Power-Full-Engagement-Managing-Performance/dp/0743226755/
ᐧ

On Mon, Jul 28, 2014 at 1:00 PM, Gannon Dick gannon_d...@yahoo.com wrote:
 Hi Paul,

 Could you elaborate what you mean by industrialized procedures ?  I've been 
 working on Work-Life Balance issues for a long time.  Rest is a  man made 
 phenomena, yet a Physician has no easy way to prescribe (relatively) exact 
 doses.

 The math is straightforward if you know how the magic FFT Black Boxes work, 
 but the concensus on the standards for industrialized procedures is 
 lacking.  Employers are likely to perceive comment on their Social Conscience 
 in an Employee exaustion diagnosis, and that might be the best result one 
 could hope for.

 --Gannon
 
 On Mon, 7/28/14, Paul Houle ontolo...@gmail.com wrote:

  Subject: Re: Call for Linked Research
  To: Sarven Capadisli i...@csarven.ca
  Cc: Linking Open Data public-lod@w3.org, SW-forum semantic-...@w3.org
  Date: Monday, July 28, 2014, 9:16 AM

  I'd add to all of
  this publishing the raw data,  source code,  and
  industrialized procedures so that results are
  truly reproducible,  as
  few results in
  science actually are.
  ᐧ

  On Mon, Jul 28, 2014 at 9:01 AM, Sarven
  Capadisli i...@csarven.ca
  wrote:
   Call for Linked Research
   
  
   Purpose: To encourage
  the do it yourself behaviour for sharing and
  reusing
   research knowledge.
  
   Deadline: As soon as
  you can.
  
   From http://csarven.ca/call-for-linked-research
  :
  
  
   Scientists and researchers who work in Web
  Science have to follow the rules
   that
  are set by the publisher; researchers need to have read and
  reuse
   access to other researchers work,
  and adopt archaic desktop-native
  
  publishing workflows. Publishers try to remain as the
  middleman for
   society’s knowledge
  acquisition.
  
  
  Nowadays, there is more machine-friendly data and
  documentation made
   available by the
  public sector than the Linked Data research community.
  The
   general public asks for open and
  machine-friendly data, and they are
  
  following up. Web research publishing on the other hand, is
  stuck on one ★
   (star) Linked Data
  deployment scheme. The community has difficulty eating
   its own dogfood for research publication,
  and fails to deliver its share of
   the
  promise.
  
   There is a social problem. Not a technical
  one. If you think that there is
  
  something fundamentally wrong with this picture, want to
  voice yourself, and
   willing to continue
  to contribute to the Semantic Web vision, then please
   consider the following before you write
  about your research:
  
   Linked Research: Do It Yourself
  
   1. Publish your
  research and findings at a Web space that you control.
  
   2. Publish your
  progress and work following the Linked Data design
   principles. Create a URI for everything
  that is of some value to you and may
   be
  to others e.g., hypothesis, workflow steps, variables,
  provenance,
   results etc.
  
   3. Reuse and link to
  other researchers URIs of value, so nothing goes to
   waste or reinvented without good
  reason.
  
   4. Provide
  screen and print stylesheets, so that it is legible on
  screen
   devices and can be printed to
  paper or output to desktop-native document
   formats. Create a copy of a view for the
  research community to fulfil
  
  organisational requirements.
  
   5. Announce your work publicly so that
  people and machines can discover it.
  
   6. Have an open comment system policy for
  your document so that any person
   (or
  even machines) can give feedback.
  
   7. Help and encourage others to do the
  same.
  
   There is no
  central authority to make a judgement on the value of
  your
   contributions. You do not need
  anyone’s permission to share your work, you
   can do it yourself, meanwhile others can
  learn and give feedback.
  
   -Sarven
   http://csarven.ca/#i
  
  



  --
  Paul Houle
  Expert on Freebase,
  DBpedia, Hadoop and RDF
  (607) 539 6254
  paul.houle on Skype   ontolo...@gmail.com




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Call for Linked Research

2014-07-28 Thread Paul Houle
A publishing venue I'll suggest to all of you is

http://arxiv.org/

which actually does get some papers relevant to the semantic web,  see

http://arxiv.org/find/all/1/abs:+rdf/0/1/0/all/0/1

arXiv is something that fits the traditional paper-based model,  but
has open access.  It is the biggest physics publisher on the planet,
and whatever bad blood their could be with journals has been decided
on the favor of arXiv largely because the APS had a PMA about it back
in the 1990s,  when I was a grad student.

I can't speak for the ACM,  but if you are interested in the politics
the czar for computer science at arXiv is Joseph Halpern at Cornell
and you should talk with him: halp...@cs.cornell.edu
ᐧ

On Mon, Jul 28, 2014 at 12:22 PM, Sarven Capadisli i...@csarven.ca wrote:
 On 2014-07-28 16:16, Paul Houle wrote:

 I'd add to all of this publishing the raw data,  source code,  and
 industrialized procedures so that results are truly reproducible,  as
 few results in science actually are.


 On Mon, Jul 28, 2014 at 9:01 AM, Sarven Capadisli i...@csarven.ca wrote:

 2. Publish your progress and work following the Linked Data design
 principles. Create a URI for everything that is of some value to you and
 may
 be to others e.g., hypothesis, workflow steps, variables, provenance,
 results etc.



 Agreed, but I think point 2 covers that. It was not my intention to give a
 complete coverage of the scientific method. Covering reproducibility is a
 given. It also goes for making sure that all of the publicly funded research
 material is accessible and free. And, one should not have to go through a
 3rd party service (gatekeepers) to get a hold of someone else's knowledge.

 If we can not have open and free access to someone else's research, or
 reproduce (within reasonable amount of effort), IMO, that research *does
 not exist*. That may not be a popular opinion out there, but I fail to see
 how such inaccessible work would qualify as scientific. Having to create an
 account on a publisher's site, and pay for the material, is not what I
 consider accessible. Whether that payment is withdrawn directly from my
 account or indirectly from the institution I'm with (which still comes out
 of my pocket).

 Any way, this is discussed in great detail elsewhere by a lot of smart
 folks. Like I said, I had different intentions in my proposal i.e., DIY.
 Control your own publishing on the Web. If you must, hand out a copy e.g.,
 PDF, to fulfil your h-index high-score.

 -Sarven
 http://csarven.ca/#i




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Updated LOD Cloud Diagram -freebase and :baseKB

2014-07-25 Thread Paul Houle
 to describe your dataset in the catalog are found
 here:

 https://www.w3.org/wiki/TaskForces/CommunityProjects/LinkingOpenData/
 DataSets/CKANmetainformation

 Please make sure that you include information about the RDF links
 pointing from your dataset into other datasets (field links: ) as well as a
 tag indicating the topical category of your dataset, so that we know how to
 include it into the diagram.
 Please also include an example URI from your dataset into the catalog.

 We will start to review the new datasets and to draw the updated version
 of the LOD cloud diagram after August 8th.
 So please point us at datasets to be included before this date.

 Cheers,

 Max, Heiko, and Chris


 --
 Prof. Dr. Christian Bizer
 Data and Web Science Research Group
 Universität Mannheim, Germany
 ch...@informatik.uni-mannheim.de
 www.bizer.de




 --
 Hugh Glaser
20 Portchester Rise
Eastleigh
SO50 4QS
 Mobile: +44 75 9533 4155, Home: +44 23 8061 5652








-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



How to write SPARQL queries against Freebase data

2014-07-18 Thread Paul Houle
See

http://blog.databaseanimals.com/how-to-write-sparql-queries-against-freebase-data

and also

http://blog.databaseanimals.com/the-trouble-with-dbpedia

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
ᐧ



Re: Linked Data and Semantic Web CoolURIs, 303 redirects and Page Rank.

2014-07-18 Thread Paul Houle
  and it does allow a certain amount of google page
 rank to flow.
 301 Moved Permanently is a poor fit for the Cool URI pattern, but passes
 on the full page rank of the links.
 rewriting all URIs the URL would also work, but would break the coolURI
 pattern.

 The pragmatist in me feels that if we are going to make a change for the
 purposes of SEO, it might as well be the one with best return, i.e. 301
 redirect.

 Note: Indexing is not the problem here, content is indexed.  The issue
 relates to page rank not flowing through a 303 redirect.

 I have tested and can confirm that 303 redirects are an issue for a number
 of reasons:

 page rank does not flow through a 303 redirect
 page rank can not be assigned from a url to a uri with a rel=canonical tag
 if URI does a 303 redirect (preventing aggregation of pagerank from external
 links to URL)
 URI and URL are indexed separately
 rdfa schema.org representations of URIs do not translate to URL (ie.
 representation described at URL A, talking about URI B, does not get
 connected to representation described at URL B)
 url parameters are not passed by a 303 redirect.
 impact on functinality of google analytics tracking eg. traversing the site
 is seen as a series of direct page visits.

 Essentially - as far as search engines are concerned - every URL and URI is
 an island, with no connections between them.  At best a URL can express a
 rel=canonical back to it's corresponding URI, no pagerank will flow through
 links.


 Any guidance you can provide would be appreciated.

 --

 o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
 | Mark Fallu
 | Manager, Research Data (Acting)
 | Office for Research
 | Bray Centre (N54) 0.10E
 | Griffith University, Nathan Campus
 | Queensland 4111 AUSTRALIA
 |
 | E-mail: m.fa...@griffith.edu.au
 | Mobile:  04177 69778
 | Phone:  +61 (07) 373 52069
 o-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Real-world concept URIs

2014-07-17 Thread Paul Houle
I can't speak for other countries in North, South and Central America,
 but I can say the that United States does not have an official
language,  even though people who hate immigrants wish it did.
ᐧ

On Thu, Jul 17, 2014 at 3:30 PM, Gannon Dick gannon_d...@yahoo.com wrote:


   If we want to differentiate between
   I like the
  zebra;
   I
  don't like the document about the zebra.

  But why do they need to be on
  the same domain? Several parties on
  different domains can represent information
  about the animal zebra.
  They just seem like
  different things to me.

 ===
 There is a what's the problem again ? component to the problem (rinse, 
 repeat).

 As evidence, I offer two factoids:
 a) The EU has 24 Official languages (http://europa.eu/)
 b) Americans speak 100+ languages at home 
 (http://www.census.gov/hhes/socdemo/language/) and have one Official 
 language.

 It seems to me those are two solutions to the problem.
 What's the problem again ? :-)

 --Gannon






-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Attribute or Property Ontology?

2014-07-13 Thread Paul Houle
I would say take a look at the Freebase Metaschema

https://developers.google.com/freebase/v1/search-metaschema

The documentation is atrocious,  but this information is available in
standards-compliant (i.e. won't crash your tools) format

http://basekb.com/

One really cool thing is that it can map property paths to properties,
 which is essential given the heavy use of compound value types in
Freebase.
ᐧ

On Thu, Jul 10, 2014 at 10:42 PM, Mike Bergman m...@mkbergman.com wrote:
 Hi All,

 I have been looking for an ontology that organizes and describes possible
 characteristics or attributes for common entity types, such as what might be
 found in a key-value pair in Wikipedia infoboxes and such.

 I have had no luck finding such a vocabulary or ontology. The closest
 representation I found was one related to sensors and the Internet of Things
 (IoT) [1]. The Wolfram Language also has an interesting structure around
 units [2]. Biperpedia has recently been discussed by Google [3], but no
 actual ontology or structure yet appears available for inspection.

 Does anyone know of a general ontology for capturing record/entity
 attributes or characteristics (properties)? I know some domains like
 biomedical may have partial approaches to this, but I'm seeking something
 that has as its intent being a general-purpose attribute reference.

 Suggestions or pointers would be greatly appreciated.

 Thanks, Mike

 [1]
 http://eprints.eemcs.utwente.nl/23734/01/CICARE2013_-_Brandt_et_al_-_Semantic_interoperability_in_sensor_applications_-_final_version.pdf
 [2] http://reference.wolfram.com/language/guide/Units.html
 [3] http://infolab.stanford.edu/~euijong/biperpedia.pdf




-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Encoding an incomplete date as xsd:dateTime

2014-06-25 Thread Paul Houle
I don't think you want to go down the road of comparing MARC and the
semantic web.  Young librarians resent all the numeric codes,  but
people walk into libraries all over the world and use MARC records and
never even know they are doing it.  The designers of MARC made good
decisions that have kept the oldest standard format with variable
length fields going strong.

You're free to publish whatever vocabulary you want and that's why the
person who wants to merge the mess you published with somebody else's
mess needs to have one ring to rule them all somewhere in their
system.  They might not even want to use it all the time,  but as the
scale of what you ingest grows,  you need to be able to tie a bow on
to a solution for this one and move on.

I think what's great about RDF and SPARQL is that there is a clear
path as to how to add new data types and functions on those types,  so
figuring out the algebra of EDTF and making it available as SPARQL
functions ought to be doable.


ᐧ

On Wed, Jun 25, 2014 at 10:42 AM, Jerven Bolleman m...@jerven.eu wrote:
 Hi Bernard,

 Please do not go down the stuff everything in a single literal format
 that EDTF is. That is not a semweb solution that is a MARC solution.
 If that level of detail is really needed modeling it using the
 TimeOntology plus extra information for seasons etc.. is the correct
 way to go.

 Problems for EDTF is that OWL reasoners don't understand it. You can't
 SPARQL with it. And in general is just not appropriate in actually
 describing the time ranges.
 In other words I do not think this is appropriate for a standard
 ontology like vCard.

 Regards,
 Jerven

 On Wed, Jun 25, 2014 at 4:26 PM, Bernard Vatant
 bernard.vat...@mondeca.com wrote:
 Hi all

 Are you aware of the Library of Congress Extended Date/Time Format (EDTF)?
 There was an interesting presentation at DC 2013 about its implementation in
 real world
 http://dcevents.dublincore.org/IntConf/dc-2013/paper/view/183

 Bernard


 2014-06-25 16:00 GMT+02:00 Paul Houle ontolo...@gmail.com:

 I've been thinking about date representations a lot lately.  Even if
 you're going to cobble something together out of the various XSD
 types,  it still helps to have a theory.

 A better underlying data type for dates is a time interval or set of
 time intervals.

 This represents the fact that many events happen over a time
 interval (such as a meeting or movie show time),  that we often only
 know a year or a day,  that things are measured on idiosyncratic time
 basis such as the fiscal years of various organizations,  that there
 are both practical and theoretical limits on both the precision and
 accuracy of time measurements.

 Intervals have their charms,  but if you include interval sets you can
 also represent concepts such as Monday, June 25 and the third
 Tuesday of the month.

 Of course,  it creates trouble that there is no total ordering over
 intervals/interval sets,  but that's a fundamental problem to any
 flexible time representation.
 ᐧ

 On Mon, Feb 10, 2014 at 9:37 AM, Heiko Paulheim
 he...@informatik.uni-mannheim.de wrote:
  Hi all,
 
  xsd:dateTime and xsd:date are used frequently for encoding dates in RDF,
  e.g., for birthdays in the vcard ontology [1]. Is there any best
  practice to
  encode incomplete date information, e.g., if only the birth *year* of a
  person is known?
 
  As far as I can see, the XSD spec enforces the provision of all date
  components [2], but 1997-01-01 seems like a semantically wrong way of
  expressing that someone is born in 1997, but the author does not know
  exactly when.
 
  Thanks,
  Heiko
 
  [1] http://www.w3.org/2006/vcard/ns
  [2] http://www.w3.org/TR/xmlschema-2/#dateTime
  [3] http://www.w3.org/TR/xmlschema-2/#date
 
  --
  Dr. Heiko Paulheim
  Research Group Data and Web Science
  University of Mannheim
  Phone: +49 621 181 2646
  B6, 26, Room C1.08
  D-68159 Mannheim
 
  Mail: he...@informatik.uni-mannheim.de
  Web: www.heikopaulheim.com
 
 



 --
 Paul Houle
 Expert on Freebase, DBpedia, Hadoop and RDF
 (607) 539 6254paul.houle on Skype   ontolo...@gmail.com




 --
 Bernard Vatant
 Vocabularies  Data Engineering
 Tel :  + 33 (0)9 71 48 84 59
 Skype : bernard.vatant
 http://google.com/+BernardVatant
 
 Mondeca
 35 boulevard de Strasbourg 75010 Paris
 www.mondeca.com
 Follow us on Twitter : @mondecanews
 --



 --
 Jerven Bolleman
 m...@jerven.eu



-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



RDFeasy DBpedia Experience

2014-06-11 Thread Paul Houle
I'm proud to announce that the RDFeasy DBpedia Experience is now
available on the AWS Marketplace

https://github.com/paulhoule/RDFeasy/wiki/RDFeasy-DBpedia-Experience
https://aws.amazon.com/marketplace/pp/B00KQPGYYA

Experience SPARQL 1.1 queries with the Virtuoso 7 column store,  SSD
storage,  and the Intel Xeon E5-2670 v2 processor with hardware
virtualization support.  Although its 399,800,349 facts include many
not found in the public SPARQL endpoint,  including the Wikipedia
pagelinks dataset,  query performance is similar to the public
endpoint,  with greatly raised timeouts and query limits to support
demanding training and RD workloads.

This data distribution was built using RDFEasy Zero

https://github.com/paulhoule/RDFeasy/wiki/RDFeasy-Zero
https://aws.amazon.com/marketplace/pp/B00KRI3DWW

an Amazon Marketplace AMI that contains tools and protocols to package
RDF data into an AMI that meets the requirements of the Amazon
Marketplace.

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
ᐧ



New RDFeasy AMIs help you build your own knowledge graph

2014-06-06 Thread Paul Houle
I just got the Complete Edition of :BaseKB approved at the AWS marketplace

https://github.com/paulhoule/RDFeasy/wiki/RDFeasy-BaseKB-Gold-Complete
https://aws.amazon.com/marketplace/pp/B00KRKRYW0

Containing all valid and relevant facts from Freebase,  this product
contains about twice as much data as the Compact Edition

https://github.com/paulhoule/RDFeasy/wiki/RDFeasy-Zero
https://aws.amazon.com/marketplace/pp/B00KDO5IFA

and thus requires a machine that is twice the size.  The RDFeasy
distributions are the only RDF data products that meet the standards
of the Amazon Marketplace and are particularly economical because they
use SSD storage that comes free with the machine instead of,  like
some other distribution,  being dependent on expensive provisioned EBS
I/O which costs an additional $120 or so per month even when you
aren't running the instance.

People are used to RDF processing of billion triple files being
difficult and expensive and are often skeptical about RDFeasy but when
people try it,  they can feel the difference with their legacy
solutions right away.

RDFeasy is open source software,  with documented operation protocols,
 but you can use the following Base AMI

https://github.com/paulhoule/RDFeasy/wiki/RDFeasy-Zero
https://aws.amazon.com/marketplace/pp/B00KRI3DWW

to get a bundle of hardware and software into which you can load your
own RDF data and package it as an AMI which can also be distributed in
the AWS Marketplace.

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com
ᐧ



RDFEasy: scalable Freebase SPARQL queries in the AWS cloud

2014-05-22 Thread Paul Houle
he first RDFEasy product is ready.

RDFEasy BaseKB Gold Compact Edition is a SPARQL database based on the
Compact Edition of :BaseKB

https://github.com/paulhoule/RDFeasy/wiki/Compact-Edition

which is a good set of facts to work from if you are interested in
Freebase.  You can experience it in the most popular cloud environment
by going to

https://aws.amazon.com/marketplace/pp/B00KDO5IFA

and making a single click.  Once the instance is provisioned you can
follow the instructions here to log in and start making queries

https://github.com/paulhoule/RDFeasy/wiki/Basic-Usage

Hardware and software inclusive,  this product costs 0.45 cents an
hour with the default configuration,  which can host the data set on
an internal SSD and handle the bruising query workloads associated
with knowledge base development.  Thus,  anyone who wants to try
powerful SPARQL 1.1 queries with Virtuoso 7 against Freebase data can
get started with very little time and money.

Some of you will want the whole enchalada and that is in the
pipeline too,  a complete copy of :BaseKB including text descriptions
and notability information.

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Alternative Linked Data principles

2014-04-30 Thread Paul Houle
.

Practically,  though,  I think of dereferencing as a special case of a
graph fragment server.  For some of the work I do,  the most
attractive algorithms use a key-value store where key is a subject and
the value is an RDF graph.  For a site like this

http://ookaboo.com/o/pictures/

I think you'd want to present me a Linked Data URI to DBpedia,
Freebase or anywhere and tell you my opinion about it,  which in this
case is all about what pictures illustrate the topic.  Then you might
want to go to a lot of other authorities and ask what they think about
it.

A least-cost dereferencing implementation based on cloud technology
would be great,  but the next step is ask ?x about ?y.

Anyway,  if you like high quality data files that are easy to use,  a
small contribution you make to this

https://www.gittip.com/paulhoule/

will pay my operating costs and for expansion of a data processing
operation which is an order of magnitude more efficient than the
typical Master Data Management operation.














On Wed, Apr 30, 2014 at 6:37 AM, Melvin Carvalho
melvincarva...@gmail.com wrote:



 On 28 April 2014 17:23, Luca Matteis lmatt...@gmail.com wrote:

 The current Linked Data principles rely on specific standards and
 protocols such as HTTP, URIs and RDF/SPARQL. Because I think it's
 healthy to look at things from a different prospective, I was
 wondering whether the same idea of a global interlinked database (LOD
 cloud) was portrayed using other principles, perhaps based on
 different protocols and mechanisms.


 If you look at the design principles behind Linked Data (timbl's or even
 Brian Carpenter's) you'll find something called the TOII -- Test of
 Independent Invention.

 What that means is if there were another system that had the same properties
 as the web, ie Universality, Tolerance, Modularity etc. using URIs it would
 be gauranteed to be interoperable with Linked Data.  Linked data is and isnt
 special.  It isnt special in that it could be independently invented by an
 equally powerful system.  It is special in that as a first mover (just as
 the web) it has the advantage of a wide network effect.

 Have a look at timbl's presentation on the TOII or at design issues axioms
 and principles.

 http://www.w3.org/Talks/1998/0415-Evolvability/slide12-1.htm



 Thanks,
 Luca





-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



Re: Semantic Web culture vs Startup culture

2014-03-31 Thread Paul Houle
 What makes me laff is that the same people who think RDF sucks
think Neo4J is the bee's knees.  (Even if they've never quite shipped
an actual product with it,  or if they did a demo it performs worse
than the same demo I did with MySQL in 2002)

Somehow,  SPARQL has never been seen as a NoSQL and I don't know why.


On Mon, Mar 31, 2014 at 1:07 PM, Gannon Dick gannon_d...@yahoo.com wrote:
 I agree, Kingsley.

 Problems with SKOS (Lists) and RDF (Lists) are implementation problems, not 
 processing problems.  It is very difficult to prevent people from perceiving 
 a first, rest, nil sequence as a Monte Carlo integration of 
 probability.  From a young age we see that, if it is summer, winter is half a 
 year forward or back and vice-versa.  What good is SKOS or RDF if the graphs 
 do not show/(provide a visualization of) that seasonal straight line 
 depreciation accounting ?

 Dilemma Answer: make up a virtuous bookkeeper's scale and divide it by 4 
 (always possible) and call it a Quarterly Conference Calls and the last one 
 an Annual Report. Profits ? Sorry, absolutely no telling when 
 gravity=(1/1)=(2pi/2pi)=(360 Degrees/360 Degrees)=(Thing/sameAs), etc.). A 
 bookkeeper is always virtuous, maybe because they are exactly congruent to 
 virtue and maybe because they fear what a psychopathic authority might do to 
 them if they fail to tell them the truth scaled to what they want to hear.  
 That is not a probability either, it protects accomplices and keeps you and 
 your friends safe. foaf:Person does not always make that my team-other team 
 relation all present and accounted for.

 http://www.rustprivacy.org/2014/balance/eCommerceVision.jpg
 http://www.rustprivacy.org/2014/balance/CulturalHeritageVision.jpg

 Superstitious, bigoted Scientists are virtuous bookkeepers who often have to 
 decide if icebergs float because they are Witches or float because they are 
 Queer.  You can't resolve that culture war by calling Alan Turing dirty 
 names, and Implementers simply can not assume that an audience who knows what 
 recursion is also knows what recursion does.  That is a semantic mistake.
 --Gannon

 
 On Sun, 3/30/14, Kingsley Idehen kide...@openlinksw.com wrote:

  Subject: Re: Semantic Web culture vs Startup culture
  To: public-lod@w3.org
  Date: Sunday, March 30, 2014, 1:00 PM

  On 3/29/14 1:41 PM, Luca Matteis
  wrote:
   Started a sort of Semantic Web vs Startup culture war
  on Hacker News:
   https://news.ycombinator.com/item?id=7491925
  
   Maybe you all can help me with some of the comments
  ;-)
  
  
  My comments, posted to the list:

  RDF is unpopular because it is generally misunderstood. This
  problem arises (primarily) from how RDF has been presented
  to the market in general.
  To understand RDF you have first understand what Data
  actually is [1], once you cross that hurdle two things [2[3]
  ]will become obvious:

  1. RDF is extremely useful in regards to all issues relating
  to Data
  2. RDF has been poorly promoted.

  Links:
  [1] http://slidesha.re/1epEyZ1 -- Understanding Data
  [2] http://bit.ly/1fluti1 -- What is RDF, Really?
  [3] http://bit.ly/1cqm7Hs -- RDF Relation (RDF should
  really stand for: Relations Description Framework) .

  --
  Regards,

  Kingsley Idehen
  Founder  CEO
  OpenLink Software
  Company Web: http://www.openlinksw.com
  Personal Weblog: http://www.openlinksw.com/blog/~kidehen
  Twitter Profile: https://twitter.com/kidehen
  Google+ Profile: https://plus.google.com/+KingsleyIdehen/about
  LinkedIn Profile: http://www.linkedin.com/in/kidehen










-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontolo...@gmail.com



parallel rdfDiff

2013-12-05 Thread Paul Houle
I just released a version of Infovore that can do scalable
differencing of RDF data sets,  producing output in the RDF Patch
format

http://afs.github.io/rdf-patch/

The tool is written up here

https://github.com/paulhoule/infovore/wiki/rdfDiff

I ran this against two different weeks of Freebase data,  filtered
through my tools,  and produced the following output in a
requester-paid bucket

s3://basekb-now/weekly-diff-2013-12-01/

Nothing is specific to Freebase though,  and this tool ought to be
adaptable to any RDF data out there.

-- 
Paul Houle
Expert on Freebase, DBpedia, Hadoop and RDF
(607) 539 6254paul.houle on Skype   ontol...@gmail.com



On the horizontal decomposition of Freebase

2013-09-18 Thread Paul Houle
I am reporting the first really useful product from the work on Infovore,
 an open source framework for processing large RDF data sets with Hadoop.
 Even if you have no experience with Hadoop,  you can run Infovore in the
AWS cloud by simply providing your AWS credentials.

As of

https://github.com/paulhoule/infovore/releases/tag/t20130917

there is a first draft of 'sieve3',  which splits up an RDF data set into
mutually exclusive parts.  There is a list of rules that apply to the
triple,  matching a rule diverts a triple to a particular output,  and
triples that fail to match a pattern fall into the 'output'.

The horizontal subdivision looks like this

http://www.slideshare.net/paulahoule/horizontal-decomposition-of-freebase

Here are the segments

'a' - rdfs:type
'description'
'key' -- keys represented as expanded strings
'keyNs' -- keys represented in the key namespace
`label` -- rdfs:label
`name` -- type.object.name entries that are probably duplicative of
rdfs:label
`text` -- additional large text blobs
`web` -- links to external web sites
`links` -- all other triples where the ?o is a URI
`other` -- all other triples where the ?o is not a Literal

Overall this segmentation isn't all that different from how DBpedia is
broken down.

Last night I downloaded 4.5 GB worth of data from `links` and `other` out
of the 20 GB dump supplied by Freebase and I expect to be able to write
interesting SPARQL queries against this.  This process is fast,  completing
in about 0.5 hrs with a smallAwsCluster.  I think all of these data sets
could be of interest to people who are working with triple stores and with
Hadoop since the physical separation can speed most operations up
considerably.

The future plan for firming up sieve3 is to get spring configuration
working inside Hadoop (I probably won't put spring in charge of Hadoop
first) so that it will be easy to create new rule sets in either by writing
Java or XML.

This data can be download from the requester paid bucket

s3n://basekb-lime/freebase-rdf-2013-09-15-00/sieved/


Infovore 1.0 released

2013-01-07 Thread Paul Houle
I’m proud to announce the 1.0 release of Infovore,  a complete RDF
processing system

* a Map/Reduce framework for processing RDF and related data
* an application that converts a Freebase quad dump into standard-compliant RDF
* an application which creates consistent subsets of Freebase,
including compound value types,  about subsets of topics that can be
selected with SPARQL-based rules
* a query-rewriting system that enforces a u.n.a island while allowing
the use of multiple memorable and foreign keyed names in queries as
does the MQL query engine
* a whole-system test suite that confirms correct operation of the
above with any triple store supporting the SPARQL protocol

See the documentation here

https://github.com/paulhoule/infovore/wiki

to get started.

The chief advantage of Infovore is that it uses memory-efficient
streaming processing,  making it possible,  even easy,  to handle
billion-triple data sets on a computer with as little as  4GB of
memory.

Future work will likely focus on the validation and processing of the
official Freebase RDF dump as well as other large RDF data sets.



Does this patent application affect named graphs?

2012-06-14 Thread Paul Houle
  Hi,  I was checking out a rather good patent search engine and
was surprised to see this

http://ip.com/pat/US20100223224

  A few years ago I did a patent search for things that could have
come out of the Cyc project and was a bit surprised to find that I
couldn't find any.  It looks like somebody is trying to patent
microtheories about 16 years after Guha's thesis presented
microtheories as an extension of FOL.

 The language in the patent application looks pretty broad to me,
but I've got no idea if it will hold up.



Re: Fwd: Knowledge Graph links to Freebase

2012-06-09 Thread Paul Houle
My guess is that the 300M entities could be hot air for now.
Maybe they've got a second true graph with 300M entities in it,  but
it's probably not powering the production system.

Right now recall is low for the Google Knowledge graph because
they don't want to take the chance of showing spurious results.  Most
Freebase topics aren't showing up and they shouldn't.  Freebase is
full of twisty little objects that all look alike  For instance,
there are 20 or so objects in Freebase named Sweet Home Alabama.
Almost all of the probability weight for this is on the radio edit,
but most of these are covers,  re-releases on greatest hits albums,
etc.  That's all very great data because it corresponds to real
observations of music in the wild,  but in the commonsense domain
these get squashed.

 Oddly,  Google loses the classic rock song entirely and turns up
a mediocre but commercially successful movie...

https://www.google.com/#hl=engs_nf=1tok=V0cZbCtNDVsjrfKATbImzwcp=7gs_id=7xxhr=tq=sweet+home+alabamapf=poutput=searchsclient=psy-aboq=sweet+haq=0aqi=g4aql=gs_l=pbx=1bav=on.2,or.r_gc.r_pw.r_qf.,cf.osbfp=f9be4f0b957a8550biw=1600bih=775

 The real value of the GKG may be in what gets deleted instead of
what gets added.

 Anyhow,  some things that ~could~ be in Freebase and aren't are

(1) Consumer Products,
(2) Local Businesses (think of what's in Foursquare or Factual),  and
(3) Google data about books

 #3 is the real sore spot.  We know Google has great metadata for
books,  but Freebase has loaded only a percentage of books from
OpenLibrary.  When I found that a number of books I was thinking about
weren't there they suggested that I finish the Open Library load
myself...

 Of course,  Google's book project is under a legal cloud and
their lawyers might feel that they aren't free to release the
metadata.



Ookaboo does #it

2010-11-16 Thread Paul Houle
Ookaboo has recently grown to contain about 480,000 pictures of
265,000 topics,  we're adding 8,000 images a day so if you don't find
what you're looking for today,  just wait a few months.  People and
places are particularly well represented,  but you'll also find
miscelaneous things like

http://ookaboo.com/o/pictures/topic/12461285/Fingerpaint

We've just done a major upgrade of our RDFa output at Ookaboo.  For
one thing,  we've clarified the relationship between a page

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum

and the thing that a page describes

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum#it

We can then say,

ookaboo:topic/2021903/Central_Air_Force_Museum sioc:topic
ookaboo:topic/2021903/Central_Air_Force_Museum#it .

Generally,  we make SIOC assertion about web pages and other sorts of
assertions about #it topics.  The primary FOAF assertion we're making
right now is foaf:depitcts,  but we're planning to make as many FOAF
assertions as possible about #it topics,  particularly about people.
The other class of page that is heavily marked up now is the
individual page.

http://ookaboo.com/o/pictures/picture/2022007/Kamov_Ka25_Hormon

Our strategy for dealing with multiple subject terminologies is to
what we call a reference set,  which in this case is

http://ookaboo.com/o/pictures/topic/2021903/Central_Air_Force_Museum#it
http://dbpedia.org/resource/Central_Air_Force_Museum
http://rdf.freebase.com/ns/m.0g_2bv

If we want to assert foaf:depicts we assert foaf:depicts against all
of these.  The idea is that not all clients are going to have the
inferencing capabilities that I wish they'd have,  so I'm trying to
assert terms in the most core databases of the LOD cloud.

In a case like this we may have YAGO,  OpenCyc,  UMBEL and other terms
available.  Relationships like this are expressed as

:Whatever ontology2:aka
http://mpii.de/yago/resource/Central_Air_Force_Museum .

ontology2:aka,  not dereferencable yet,  means (roughly) that some
people use term X to refer to substantially the same thing as term Y.
 It's my own answer to the owl:sameAs problem and deliberately
leaves the exact semantics to the reader.  (It's a lossy expression of
the data structures that I use for entity management)

Any thoughts or suggestions about what I should be doing differently or better?



Re: overstock.com adds GoodRelations in RDFa to 900,000 item pages

2010-10-07 Thread Paul Houle
On Wed, Oct 6, 2010 at 1:49 PM, Martin Hepp martin.h...@ebusiness-unibw.org
 wrote:



 It is too expensive to expect data owners to lift their existing data to
 academic expectations. You must empower them to preserve as much data
 semantics and data structure as they can provide ad hoc. Lifting and
 augmenting the data can be added later.


 Don't get the idea that academic expectations are better than
commercial expectations,  they're just different.

 The whole point of Ontology2 is to commercize information extraction
with a philosophy very much like what these folks are doing:

http://rtw.ml.cmu.edu/papers/carlson-aaai10.pdf

  Now in some ways they've got something way more advanced than what
I've got:  however,  they say that their ontology is populated with 242,453
new facts with estimated precsion on 74%.

  For me,  I can't get away with an estimated precision of 74%,  I'd
look like a total fool publishing data that dirty on the web,  unless I can
find some way to conceal the dirt.  Talking with people who are interested
in semantic technology for e-commerce,  I find a common desire is to not
only reduce the cost of human labor but to also build systems that attain
superhuman accuracy in describing and categorizing products (at least better
accuracy than the people who are doing this job today.)

  [Note also that the rate of fact extraction these guys are doing isn't
so hot either... You can get 10^7-10^8 facts out of dbpedia+freebase
covering a similar domain]

  From a commercial viewpoint,  imperfect data is an opportunity.  If I
didn't have other projects ahead of it in the queue,  I'd seriously be
thinking about building a shopping aggregator that cleans up GoodRelations
and other data,  reconciles product identities,  categorizes products,
creates good product descriptions,  and make something that improves on
current affiliate marketing and comparison shopping systems.

  Note that the beauty of an ontology is in the eyes of a user.  One
user might want to have a broad but vague ontology of products,  they are
happy to say that a digital camera is a :DigitalCamera.  Other people might
want to just cover the photography domain,  but do it in great detail --
describing both the differences between digital cameras manufactured today
but also lenses,  and even covering,  in great detail,  vintage cameras that
you might find on eBay.

  You can't say that one of these ontologies is better than the other.
The best thing is to have all of these ontologies available [populated with
data!] and to pick and choose the the ones that fit your needs.


Re: overstock.com adds GoodRelations in RDFa to 900,000 item pages

2010-10-07 Thread Paul Houle
On Wed, Oct 6, 2010 at 5:09 PM, Martin Hepp martin.h...@ebusiness-unibw.org
 wrote:
I've got mixed feelings about snippets vs fully embeded RDFa.  For
the most part I think systems that use snippets will be more maintainable,
but I've seen cases where fully embedded RDFa fits very well into a system
and there may be cases where the size of the HTML can be reduced by using it
-- and HTML size is a big deal in the real world where loading time matters
and we're increasingly targeting mobile devices.

The RDFa issue that really bugs me is that a linked data URI can be read
to signify a number of different things.  Consider,  for instance,

http://dbpedia.org/resource/Rainbow_Bridge_(Tokyo)

(i) This is a string.  It has a length.  It uses a particular subset of
available characters
(ii) This is a URI.  It has a scheme,  it has a host,  path,  might have a #
in it,  query strings,  all that;  a number of assertions can be made about
it as a URI
(iii) This is a document.  We can assert the content-type of this document
(or at least one version we've negotiated),  we can assert it's charset,
length in bytes,  length in characters,  particular subset of available
characters used,  number of triples asserted directly in the document,  the
number of triples we can infer by applying certain rules to this in
connection with a certain knowledgebase,  and on and on
(iv) This is about a wikipedia article (some wikipedia articles don't map
cleanly to a named entity)
(v) This is about a named entity

The more I think about it,  the more I it bugs me,  and it's all the worse
when you've using RDFa and you've got HTML documents.

For instance,  you could clearly see

http://ookaboo.com/o/pictures/topic/28999/Beijing

as a signifier for a city.  Some people would make the assertion that

dbpedia:Beijing owl:sameAs ookaboo:topic/28999/Beijing.

and that's not entirely stupid.  On the other hand,  it's definitely true
that

ookaboo:topic/28999/Beijing is sioc:ImageGallery.

Put something true together with a practice that's common and you get the
absurd result that

dbpedia:Beijing is sioc:ImageGallery.


How should I link a predicate?

2010-08-19 Thread Paul Houle
 I'm planning to define a few predicates because I think existing
predicates don't exactly express what I'm trying to say.

 Since a predicate is a URI,  there's the question of What should be
served up at the the URI if somebody (a) types it into the browser,  or (b)
looks at it with a semweb client?

 What's the best thing to do here.  It might be lame,  but I'm thinking
about making the predicate URL do a 301 redirect to a CMS page that has a
human-readable description of the predicate.

 I suppose that a predicate URL page could also have some RDF assertions
on it about the predicate,  for instance,  a collection of OWL assertions
about it could be useful...  However,  beyond that,  I don't think the state
of the art in upper ontologies is good enough that we can really make a
machine readable definition of what a predicate means at this time.

For the predicate that I need most immediately,  there's the issue that
there are optional OWL statements that could be asserted about it that would
provide an interpretation that some people would accept some of the time --
however,  I wouldn't be coining this predicate if I thought this
interpretation was 100% correct.  In this case,  I think the best I can do
is make a human-readable assertion that

You could put this assertion about my predicate in your OWL engine if you
wish

and leave it at that.

Any thoughts?


Re: Linked Data and IRI dereferencing (scale limits?)

2010-08-05 Thread Paul Houle
If you want to get something done with dbpedia,  you should (i) work from
the data dumps,  or (ii) give up and use Freebase instead.

I used to spend weeks figuring how to to clean up the mess in dbpedia until
the day I wised up and realized I could do in 15 minutes w/ Freebase what
takes 2 weeks to do w/ dbpedia,  because w/ dbpedia you need to do a huge
amount of data cleaning to get anything that makes sense.

The issue here isn't primarily RDF vs Freebase but it's really a matter of
the business model (or lack thereof) behind dbpedia;  frankly,  nobody gets
excited when dbpedia doesn't work,  and that's the problem.  For instance,
nobody at dbpedia seems to give a damn that dbpedia contains 3000
countries,  wheras there's more like 200 actual active countries in the
world...  Sure,  it's great to have a category for things like
Austria-Hungary and The Teutonic Knights,  but an awful lot of people
give up on dbpedia when they see they can't easily get a list of very basic
things,  like a list of countries.

Now,  I was able to,  more-or-less,  define active country as a
restriction type:  anything that has an ISO country code in freebase is an
active country,  or is pretty close.  The ISO codes aren't in dbpedia
(because they're not in wikipedia infoboxes) so this can't be done with
dbpedia:  i'd probably need to code some complex rules that try to guess at
this based on category memberships and what facts are available in the
infobox.

I complained on both dbpedia and freebase discussion lists,  and found
that:  (i) nobody at dbpedia wants to do anything about this,  and (ii) the
people at freebase have investigated this and they are going to do something
about it.



In my mind,  anyway,  the semantic web is a set of structured boxes. It's
not like there's one T Box and one A Box but there are nested boxes of
increasing specificity.  In the systems I'm building,  a Freebase-dbpedia
merge is used as a sort of T' Box that helps to structure and interpret
information that comes from other sources.  With a little thinking about
data structures,  it's efficient to have a local copy of this data and use
it as a skeleton that gets fleshed out with other stuff.  Closed-world
reasoning about this taxonomic core is useful in a number of ways,
particularly in the detection of key integrity problems,  data holes,
inconsistencies,  junk data,  etc.  I think the dereference and merge
paradigm is useful once you've got the taxocore and you're merging little
bits of high-qualtiy data,  but w/o control of the taxocore you're just
doomed.


RDF and its discontents

2010-07-02 Thread Paul Houle
Here are some of my thoughts

(1) The global namespace in RDF plus the concept that most knowledge can be
efficiently represented with triples are brilliant;  in the long term we're
going to see these two concepts diffuse into non-RDF systems because they
are so powerful.  I appreciate the way multiple languages are implemented in
RDF -- although imperfect,  it's a big improvement over what I've had to do
to implement multi-lingual digital libraries on relational systems.

(2) Yet,  the big graph and triple paradigms run into big problems when we
try to build real systems.  There are two paradigms I work in:  (i) storing
'facts' in a database,  and (ii) processing 'facts' through pipelines that
effectively do one or more full scans of data;  type (ii) processes can be
highly scalable,  however,  when they can be parallelized.

Now,  if hardware cost was no object,  I suppose I could keep triples in a
huge distributed main-memory database.  Right now,  I can't afford that.
 (If I get richer and if hardware gets cheaper,  I'll probably want to
handle more data,  putting me back where I started...)

Today I can get 100x performance increases by physically partitioning data
in ways that reflect the way I'm going to use it.  Relational databases are
highly mature at this,  but RDF systems barely recognize that there's an
issue.  Named graphs are a step forward in this direction,  but to make
something that's really useful we'd need both (a) the ability to do graph
algebra,  and (b) the ability to automatically partition 'facts' into
graphs.  That 'automatic' could be something similar to RDBMS practice (put
this kind of predicate in that graph,  put triples with this sort of
subject in that graph) or it could be something really 'intelligent',  that
can infer likely use patterns by reasoning over the schema and/or by
adaptive profiling of actual use (as Salesforce.com does to build a pretty
awesome OLTP system on top of what's a triple store at the core.)

Practically,  I deal with this by building hybrid systems that combine both
relational and RDF ideas.  If you're really trying to get things done in
this space,  however,  it's amazing how precarious the tools are.  For
instance,  I looked at a large number of data stores and wound up choosing
MySQL based on two fairly accidental facts:  (i) I couldn't get VARCHAR() or
TEXT() fields in other RDMS systems to handle the full length of Freebase
text fields in an indexable way,  and (ii) mongodb crashes and corrupts
data.

As for the linear pipelines,  the big issue I have is that I want to process
facts as complete chunks;  everything needed for one particular bit of
processing needs to get routed to the right pipeline.  If it takes four
triples involving a bnode to represent a 'fact',  these all need to go to
the same physical node.  As in the database case,  partitioning of data
becomes a critical issue,  but it becomes even more here that the partition
a particular triple falls in might be determined by the graph that surrounds
a triple,  which kind of points to a representation where we (a) develop
some mechanism for efficiently representing subgraphs of related triples,
 or (b) just give up on the whole triple thing and use something like a
relational or JSON model to represent facts.

(3) I've spoken to entrepreneurs and potential customers of semantic
technology and found that,  right now,  people want things that are beyond
the state of the art.  Often when I consult with people,  I come to the
conclusion that they haven't found the boundaries of what they could
accomplish through plain old bag of words and that it's not so clear
they'll do better with NLP/semantic tech.  Commonly,  these people have
fundamental flaws in their business model (thinking that they can pay some
Turk $0.20 cents to do $2000 worth of marketing work.)  The most common
theme in semantic product companies that that they build complex systems
out of components that just barely work.

I'll single out Zemanta for this,  although this is true of many other
companies.  Let's just estimate that Zemanta's service has 5 components and
each of these is 85% accurate;  put those together,  and you've got a system
that's just an embarrassment.  There are multiple routes to solving this
problem (either a widening of the scope or a narrowing of the scope
could help a lot) the fact is that a lot of companies are aiming for that
sour spot which has the paradoxical dual effects that:  (i) some others
imitate them,  and (ii) others write off the whole semantic space.  Success
in semantic technology is going to come from companies that find fortuitous
matches between what's possible and what can be sold

Another spectre that haunts the space is legacy information services
companies.  I've talked with many people who think they're going to make big
money selling a crappy product to undiscriminating customers with deep
pockets (U.S. Government,  Finance,  Pharma, ...)  I think the actual

Re: RDF and its discontents

2010-07-02 Thread Paul Houle
On Fri, Jul 2, 2010 at 11:20 AM, Henry Story henry.st...@gmail.com wrote:



 So similarly with RDF stores. Is it not feasible that one may come up with
 just in time
 storage mechanisms, where the triple store could start analyising how the
 data was used in
 order then to optimise the layout of the data on disk?  Perhaps it could
 end up being a
 lot more efficient than what a human DB engineer could do in that case.


That's a nice research project and it could be a very nice project if it's
perfected.  Salesforce.com has a patent on something that's pretty similar:

http://www.faqs.org/patents/app/20090276395

I attended a talk at Dreamforce last year where they described how their
system works.

To a developer,  salesforce.com offers something that looks a lot like a
relational database.

Their customers are spread out on about 10 distinct Oracle 10g clusters;
 each of these has a central fact table which is essentially a triple/quad
store.  Rows seen from the customer's perspective are actually atomized
into individual triples...  the core table,  however,  has additional tags
which identify each triple as belonging to a particular
salesforce.comcustomer.  This way there might be 10,000-100,000
customers that share an
'instance' of the Salesforce.com system.

Now,  to supplement this,  Salesforce.com creates additional relational
tables in Oracle that speed up particular queries.  It uses automatic
profiling to decide when it's going to create these tables,  create indexes,
 etc.

It's pretty amazing to watch.  I've built a system that communicates with
Salesforce.com via the API.  The first time I run it against a salesforce
instance,  one of the queries it runs times out.  If I run it again
immediately,  it times out again.  If I come back in ten minutes,  it works
O.K.  because the system has analyzed my query and built the structures to
make the query efficient.

That said,  Salesforce.com is designed for OLTP applications and sucks for
analytical work.  You're only allowed to get information in limited size
chunks; until very recently there wasn't anything like GROUP BY.  More to
the point,  Salesforce.com charges about $1500/month/GB of storage.  This is
affordable for OLTP work,  but the semantic work I do involves so much data
that I couldn't possibly afford that.


NOT(:A owl:sameAs :A)

2010-05-13 Thread Paul Houle
  I've been doing a reconciliation project between DBpedia,  Freebase,
Open Street Maps and a few other sources and that's gotten me thinking about
the practical and philosophic implications of terms.  In particular,  I've
been concerned with the construction of 'Vernacular' meanings that may not
be perfectly precise but would seem normal to people.

  One thing I've realized is that no term is atomic,  and a corollary of
that is that A != A in commonsense reasoning.

  Openmind Commonsense,  for instance, contains a number of english
sentences that make assertions about the term Oxygen,  such as

The earth's atmosphere contains Oxygen (obviously true)
The ocean contains Oxygen (true in two different senses:   there is
dissolved diatomic oxygen in the ocean that fish can breathe;  however,
most of the mass of the ocean is oxygen atoms that are part of water
molecules.  However,  we can't breathe the oxygen in the ocean,  so I
couldn't berate somebody for answering 'no' to this question.)

   I felt uncomfortable confirming the truth about the above term,  but
it gets worse

The earth's crust contains Oxygen (certainly true in the sense of the
atom,  much of the mass of rock is oxides)
The moon contains Oxygen (same,  but 'everybody knows' the moon is a place
that's inhospitable for life because there's no 'Oxygen' there)

  Now the reason I'm uncomfortable with all this is that the term
'Oxygen' isn't atomic;  if we split it into 'Diatomic Oxygen',  'The Element
Oxygen',  and finer terms,  we can actually make assertions that aren't so
problematic.

  If we go to work seriously splitting up 'Oxygen',  there are the
senses of

'Oxygen as something essential to life' (How do you understand the Matthew
Sweet song Love is Like Oxygen?)
'Oxygen as a medical treatment' (It's got some unique identifier that's
provided to your health insurancer to identify it as such)
A number of allotropes of Oxygen (I'd think that any worldy person ought to
know about O2 and O3,  but O2 actually has 'singlet' and 'triplet' forms,
and there are a few bizzare allotropes that have been made in the
laboratory.)
A number of isotopes of Oxygen

 If we stopped somewhere near there,  we'd have covered most of it,  but
scientists can split terms further.  For instance,  you could have O2 in the
triplet state with one 12O atom and one 23O atom.  I suppose physicists
could ask 'what if' the binding energies and masses of quarks were a bit
different and how that would effect stellar nucleosynthesis,  so somebody
could make assertions about the nuclear properties of oxygen isotopes.

Note that the specification of terms like that is a process of division
(creating a finer vocabulary) and then composition (using combinations of
terms to specify new terms.)

Thinking this way,  it's clear that owl:sameAs is something that ought
to be taken with a grain of salt.  In fact,  it's the least of the problems
that we face trying to 'make sense' of terms.



 Getting practical,  I'm working on a vernacular vocabulary for human
settlements.  You might think this is something you could get in a can,  but
you can't quite.  The real requirement here is that end users see terms that
match the vernacular language they use.  For instance,  you can't tell
people that 'Wisconsin' or 'Kanagawa' are '2nd level administrative
subdivisions',  but rather,  that 'Wisconsin' is a 'State' and 'Kanagawa' is
a 'Prefecture'.

 Just about anybody will tell you that

London is a city and Tokyo is a city

 but neither of those is legally true;  both of those are administrative
divisions larger than a city.  However,  if you make a list of major global
cities people are going to think you're nuts if you don't put them on the
list.  Lately I've had a lot of attraction for the Germanic concepts of
'Stadt' (avoids the need to make the arbitrary division between 'City' and
'Town) and 'Dorf';  I live in something that definitely isn't a 'Stadt',
maybe not even an Anglo-Saxon 'Town' by technical terms.  (It feels more
like an 'administrative division' that happens to maybe have 3-6 'Dorfs' in
it and a lot of public land,  scattered houses and farms.)  However,  I
write a tax check every year to the 'Town of Caroline' so I've got to say
it's a 'Town'.

 I'm finding a need to create a specific 'Vernacular' vocabulary layer
so my systems make sense to people:  so they can understand what they see on
the screen and so full-text search works right.  The main
operational definition is What do people commonly call it?  If I can't
find a specific term,  I fall back on defaults.

 Systems like this suck at reasoning,  however,  because the terms are
all imprecise,  so for a lot of applications,  vernacular layers are going
to need to coexist with more specific terminology layers that have the right
technical properties for solving certain kinds of problems.


Re: Web Linking and @rev

2010-05-13 Thread Paul Houle
I love @rev...  It's a handy thing to have if you're trying to make
statements about the current document in the head area.


Re: What would you build with a web of data?

2010-04-09 Thread Paul Houle
On Fri, Apr 9, 2010 at 4:48 AM, Georgi Kobilarov georgi.kobila...@gmx.dewrote:


 Let's be creative about stuff we'd build with the web of data. Assume the
 Linked Data Web would be there already, what would build?




Lots of things:

(1) A 'smart' encyclopedia that can reformat Wikipedia (and other content)
for specific audiences/context.  For instance,  a children's encyclopedia,
 an encyclopedia about what people in Heian Japan could have plausibility
known about,  etc.

(2) Systems that add 'Xanalogical' (sorry T. Nelson) structure to text based
on text understanding.  For instance,  if I'm reading a text,  I want
something that can infer intertextuality and add 'footnotes' that clarify
any issue that I want clarified.  The text involved could be anything:
 tweets,  cheat sheets for video games (how exactly can I get that item?),
 scientific papers,  Shakespeare,  even parallel texts.  (I'd love to line
up an English translation of the Kojiiki w/ the archaic original Japanese
and have tools that let somebody who barely understands Japanese [me] get a
lot out of it)

(3) A site like boxedup.com without all the stupid web 2.0 features that
never really worked...  I want to be able to just bookmark an item and have
the system extract good data about it...  and NOT ask me to fill out tags.

The book Pull makes a good case for how semantic technology enables
ambient computing...  I had a talk with an a MIT media lab graduate years
ago who was fundamentally skeptical about the ability of computers to
understand context and do the kind of things that a good butler does.
 Ultimately a big 'commonsense' knowledge base is going to make 'impossible'
things happen.

http://thepowerofpull.com/pull/


Notes on RDFa For Turtles

2010-03-12 Thread Paul Houle
I’ve gotten some great feedback and spent some time looking at specs, and
here are some thoughts.

“RDFa For Turtles” is an RDFa subset (profile?) that makes it easy to
specify triples in the HEAD of an HTML document.   I intend to distill it
into a short “HOWTO” document that any webmaster can understand and
correctly apply.  Towards that goal, it fixes certain choice to improve
interoperability.  I'm interested in any feedback that can further improve
interoperability.

---

Now, it’s tempting to say that RDFa documents can be ‘duck typed’; that is,
try to extract some RDFa triples, and if you get some, it’s an RDFa
document.

I see one problem with that.  Superficially, it looks like RDFa should be
able to extract old-school


link rel=”{p}” href=”{o}” /

elements from the headers of legacy HTML and XHTML documents.   (Similarly,
people are being told to add a rel=”license” href=”{o}” to today’s HTML
documents.)

These constructions havve the desired effect when the base of the document
is not specified, however, the use of base href=”{base}”  causes the RDFa
interpretation of these constructs to be different from that in legacy
systems.

To avoid this and other problems,  “RDFa For Turtles” documents use the
@about attribute of the html element to  explicitly specify the URL of the
current document.   “RDFa For Turtles” documents are not allowed to assert
triples about the current document unless that URI of that document is
specified explicitly.

It would be nice to have a reliable and simple way to make statements about
the current document for documents that are being written by hand,  but I
don’t think there is any,  at least not if the base element is in use.

---

When used in an XHTML document,  “RDFa For Turtles” uses the standard
DOCTYPE for XHTML+RDFa documents and is completely conformant with the
XHTML+RDFa specification.

“RDFa For Turtles” embedding in HTML will be based on the HTML 5 + RDFa
specification,

http://dev.w3.org/html5/rdfa/rdfa-module.html

but will not require the use of a specific DOCTYPE.   “RDFa For Turtles”
will follow changes in HTML5 + RDFa as the standard matures.

Conformant “RDFa for Turtles” HTML documents have the following
characteristics:
(i) One or more RDFa statements can be extracted from the head using the
HTML 5 + RDFa rules
(ii) No statements are made without an explicit @about;  the head element
cannot contain an @about element;  the html element can contain only an
@about element that points to the current document URI
(iii) RDFa statements specified in the head must use the restricted “RDFa
For Turtles” vocabulary,   however
(iv) Arbitrary RDFa statements are allowed elsewhere in the document,
subject only to rule (ii)

“RDFa For Turtles” uses xmlns notation to define CURIE prefixes, as do
current HTML+RDF specifications,  but will track the change if a new
mechanism is introduced.

---

Here’s the “RDFa For Turtles” vocabulary:  all of these go into the head

If @about is specified explicitly in the html element, we can write

meta property={predicate} content={object}
meta property={predicate} content={object}

To assert triples about the current document.  We can also use @datatype
here,  @lang in HTML documents,  and @xml:lang in XHTML documents.

We can also write

link rel=”{reserved_value}” href=”{object}”
link rev=”{reserved_value}” href=”{subject}”

And

link rel=”{curie_predicate}” resource=”{object}”
link rel=”{curie_predicate}” resource=”{subject}”

I think the differential use of href and resource here maximizes backwards
and forwards compatibility.  It is non-conformant to use legacy
link/@relelements if @about is not set in the html element.  Also,

link about=”{any_subject}” rel=”{reserved_value}” href=”{object}”
link about=”{any_subject}” rel=”{reserved_value}” href=”{object}”

Is disallowed because legacy clients could misinterpret it.  If we want to
assert predicates such as “alternate”, “cite” about documents that are not
the present document, we need to add a namespace declaration like

xmlns:xhv=”http://www.w3.org/1999/xhtml/vocab#”

and then write the predicate as a CURIE:

link about=”{any_subject}” rel=”xhv:{reserved_value}” resource=”{object}”

Of course, to assert triples about other things, we may add @about to link
and meta.  The shorthand form

link about=”{subject}” typeof=”{object}”

Is equivalent to

link about=”{subject}” predicate=”rdf:type” value=”{object}”

@about is required when using @typeof.

RDFa For Turtles allows (but discourages) the creation of blank nodes with
the safe CURIE syntax [_:suffix]

RDFa For Turtles supports the full range of possible syntax in @about,
@resource,  @rel, @rev and @property attributes (other than the
compatibility restrictions above.)  I’m planning,  however,  to split the
“RDFa For Turtles” spec to have a basic half that leaves out certain
features (@typeof,  use of Safe CURIEs, and space-separated CURIE/URI lists)
and an advanced half that adds a few dashes of syntactic sugar.


Re: RDF Serializations

2010-03-12 Thread Paul Houle
On Fri, Mar 12, 2010 at 5:23 AM, Nathan nat...@webr3.org wrote:


 What I'm gunning for in the end, is to only expose all linked data / rdf
 as static RDF+XML documents within my application - would this in any
 way make the data less linked because some clients don't support
 RDF+XML or could I take it for granted that everybody (for instance
 everybody on this list) could handle this serialization.

  At some point you've got to put your foot down and stop supporting new
output formats.  If I invent one tomorrow,  that doesn't mean you have to
support it.

 There are three standards that I see in use:

(i) RDF/XML for relatively small triple sets (triples about a subject) that
are published in the typical linked data style that are not embedded in
other documents.  RDF/XML is particularly used for bnode-heavy applications
such as OWL schemas.
(ii) RDFa for embedding triples in other documents,  again,  in the linked
data context where any individual document contains just a small fraction
of the data in the system
(iii) Turtle-family serializations for large whole system dumps,  such as
the dbpedia dumps

 RDF/XML isn't my favorite serialization,  but it really does seem to be
the most widespread;  I think all linked data systems are going to support
it for input,  and ought to support it for output,  unless they are going
the publish RDFa embedded in document route.

 I think the software complexity argument against RDF/XML is weak these
days,  because we've had a decade to get good RDF/XML parsers.  if you're
working in some mainstream language,  it's just something you can download
and run.

 My more serious beef with RDF/XML is pedagogical:  it's not a good way
to teach people RDF because it's not immediately obvious to the beginner
where exactly the triples are.  RDF data modelling is actually incredibly
simple,  but you wouldn't know that if you started with RDF/XML.  Turtle,
however,  helps you understand RDF at a triple-by-triple level...  Once
you've gotten some experience with that,  RDF/XML makes a lot more sense.


RDFa for Turtles

2010-03-10 Thread Paul Houle
I'm working on linked data output for

http://ny-pictures.com/

so I'm trying to come up with an RDFa Profile that's appropriate for my
application;  goals are (i) separation of presentation in content (I can
move my divs around and never end up losing an about) and (ii) high
compatibility with existing web browsers and applications.

Last night I was researching the WHAT-WG HTML 5 standards from metadata and
was pretty terrified,  and came out with the conclusion that we really need
a simple way to assert RDF triples that anybody can lose and that has little
impact on current server and client toolchains.

-

Restriction:  All RDFa statements will be asserted in the head of the
document,  using certain patterns.

The patterns that will be used are

meta rel={predicate} content={object} /
meta rel={predicate} content={object} datatype={type_of_object} /

when we want to assert non-link properties about the current document

link rel={predicate} href={type_of_object} /

for asserting link properties.  Now,  if we want to assert a property with a
different subject,  we can just write

meta about={subject} rel={predicate} content={object} /
link about={subject} rel={predicate} href={object} /

There's also a nice bit of syntactic sugar

link rev={predicate} href={subject} /

when we want to make an assertion about which the current document is the
object.

My understanding is that this is a correct subset of RDFa that makes it
possible to specify arbitrary triples.  Am I missing anything?

I was initially worried about the lack of the name element on the meta
tags,  but a close look at the HTML 4 spec shows that the name is not a
required element.  I don't believe that this syntax is going to cause any
problems for mainstream software

--

So far I've been sticking to the letter of HTML and RDFa specs,  or at least
trying to.  Now I'm going to add

Requirement: Ability to embed RDFa in a document that is not,  globally,  a
valid XHTML document

There are two major reasons for this.  (A) the ability to create the web
experiences we want with standard markup is something that web developers
have long suffered.  Finally,  if you get your DOCTYPEs right,  you can get
fairly consistent rendering with five major web browsers.  If you use
XHTML,  you get less consistent behavior from current web browsers.

(B) Most conventional web toolchains (PHP,  JSP,  Cold Fusion,  Ruby on
Rails,  etc.) produce HTML by concatenating fragments of text.  In that
context,  it's almost impossible to verify that a dynamic web site is going
to produce valid XHTML under all circumstances.  Now,  you can build your
documents with real XML tools,  but that leads to a serious intermingling of
presentation and content:  it's hard to have a designer design a template in
Dreamweaver,  and it could require a rookie programmer hours to figure out
how to change a simple bit of text that is output by the application.
ASP.NET actualy does have a reasonable answer,  but it's pretty hard to get
anything else right if you use ASP.NET


RDFa for Turtles 2: HTML embedding

2010-03-10 Thread Paul Houle
Specific proposal for RDFa embedding in HTML


Ok,  here's a strategy for embedding RDFa metadata in HTML document heads
-- make the head of the document be a valid XHTML fragment.

Here,  now,  I'm going to write something like

head xmlns=http://www.w3.org/1999/xhtml; xmlns:dcterms=
http://purl.org/dc/terms/;
meta rel=dcterms:creator content=Ataru Morobishi
/head

Because the content of the meta area is so simple,  compared to other
parts of an html document,  I feel comfortable publishing a valid XHTML
fragment for the head.  My understanding is that the namespace declarations
will just be ignored by ordinary HTML tools (as they are in
backwards-compatible XHTML documents) so there's really no problem here.

This does bend the XHTML/RDFa standard and also HTML a little (those
namespace declarations aren't technically valid) but I think we get a big
gain (even a Turtle-head can embed triples in an HTML document) for very
little pain.

Any thoughts?


Re: RDFa for Turtles 2: HTML embedding

2010-03-10 Thread Paul Houle
Melvin Carvalho:

  This seems sensible.  I did have the idea of dumping a whole bunch of RDFa
 triples in the footer, and setting visibility to zero, but if you can do it
 safely in the head, problem solved!   We're back to the old days of putting
 meta data in the head of a document.

 This works for the rel attribute, but what about for property?

 One nice thing about RDF is that it's a set, so if you put a full dump of
 triples in one area, even if there's a dup somewhere in your markup, a
 parser, should remove duplicates.


Here's my take.  RDFa gives us a lot of choices.  Embedding in the HEAD,
embedding in a display:none DIV and closer embedding to the document text
are all choices somebody might take.  I want to develop a profile that works
with the widest range of RDFa tools that works with certain engineering
decisions and that is easy for document authors to understand.  I think
RDFa Lite formats will be a good stepping stone towards building the RDFa
ecology.

Practically,  my systems have long had a separate module that manages the
HEAD,  so that resuable modules with visual appearances can add style sheets
and Javascript includes to the document.  I've recently added a little
triple store to the HEAD manager so that it accumulates triples as the
document is rendered.  I'm sure there are other great ways to add RDF output
to an app,  but this approach separates presentation and content.

Anyway,  looking at your comment and at the docs,  I'd conclude that it
should be

#1 meta property={predicate} content={object} /

I'm seriously conflicted about another possibility,

#2 link rel={predicate} resource={object} 

I like resource better than I like href here,  because resource lets
us use a CURIE instead of a regular URI if we like.  This is very sweet
syntactic sugar,  but the cost is that existing HTML clients already know
what the href means...  Use of this isn't too heavy,  but I'd hate to
break RDF autodiscovery and the the apps that use it.

Towards that goal,  I think I'd just tell people to use href and write out
the whole URLs;  I've got a strong feeling that I want to give people one
way to do it,  because I think this improves the odds that (i) people read
the docs,  (ii) follow the instructions,  and (iii) do so successfully.


DC-HTML as a linked data format

2010-03-08 Thread Paul Houle
 Hello,  I was talking to Kingsley the other day about lightweight
mechanisms for adding linked data to existing HTML pages.  I started adding
some geotags and some statements from the Dublin Core vocabulary,  using the
DC-HTML format:

http://dublincore.org/documents/dc-html/

 I was pretty impressed with the ease of adding simple statements,  as seen
in a page for a topic:

http://ny-pictures.com/nyc/photo/topic/3461/Empire_State_Building

or in a page for an individual photograph

http://ny-pictures.com/nyc/photo/picture/29218/empire_seen_top_rockefeller_smoggy

 I see that DC-HTML has some serious drawbacks (you can only make
statements about the page that you're on) but it seems much more practical
than RDFa on today's web.  RDFa is certainly more capable,  but I've got a
number of issues with it:  most seriously,  I don't want to commit to XHTML
output,  because it's technically difficult to built web systems that always
output valid XHTML.  (I suppose I could generate everything from the XML
DOM,  but then I can't ever hire a web designer to make my templates;  for
that matter,  just changing something on one of the pages could go from
being a 5 minute job to a 5 hour job for a programmer.  ASP.NET offers a
pretty good answer to this problem,  but it comes with a lot of baggage that
I'd rather avoid.)  XHTML has rendering issues on today's web browsers,  and
also,  I'm pretty concerned about interoperability:  because the RDFa spec
offers web developers so many choices,  I think they'll often make the wrong
ones.

Here are some questions I'd like to see answered.

(i) Are there useful online validators for DC-HTML and RDFa?  [Validation
can mean different things:  on one hand,  there's view the triples asserted
here but I'd also like some help knowing that I'm using standard (or well
known) predicates and that I'm formatting my data correctly.  (For
instance,  is my XSD#Date formatted the right way,  are my geo coordinates
correctly formatted,  etc.)
(ii) What's the 'standards' basis for DC-HTML?  Pretty clearly it's
reccomended by the DCMI,  but because it's about asserting RDF triples,  I
could see it coming under the scope of the W3C.
(ii) What are the practical statuses of DC-HTML and RDFa support in the
wild?


Re: Context Tags, Context Sets and Beyond Named Graphs...

2010-01-19 Thread Paul Houle
On Mon, Jan 18, 2010 at 2:20 PM, Leigh Dodds leigh.do...@talis.com wrote:


 Looks to me like you need Named Graphs plus a mechanism to describe
 combinations of graphs.


Exactly!

That for me was what I liked about the idea:  having a mechanism to do the
things I want that builds on all the work people are doing w/ Named Graphs.



 ...and these as more Named Graphs, or at least graphs that are derived
 from those in the underlying data store. I tend to refer to these as
 synthetic graphs. Most SPARQL implementations have the concept of at
 least one synthetic graph: the union of all Named Graphs in the
 system. But as I alluded to in a recent posting [1], there are many
 other ways that these graphs could be derived. Rather than building
 them into the implementation, they could be described and using a
 simple domain specific language. So I think Named Graphs plus graph
 algebra gives you much of what you want.

 Cheers,

 L.

 [1]. http://www.ldodds.com/blog/2009/11/managing-rdf-using-named-graphs/


That's a nice link.

I like the term graph algebra,  because that really is what I'm talking
about.  It's pretty clear that an almost unlimited number of synthetic
graphs are possible:  for instance,  if there's a SPARQL query that
generates a graph,  that could define a named graph which is a lot like a
view in SQL.  In fact,  I could see this being computed on the fly,  or
being materialized,  like a temporary table in SQL.

Specifically,  however,  I need the ability to stick named graph tags
cheaply on items in a local RDF store (specify that a triple is in 10 named
graphs w/o copying it 10 times),  and to be able to efficiently do graph
algebra involving unions and intersections of graphs defined by those tags.
 I'm thinking about using this on triple stores with between 1 billion-100
billion triples. On the low end I expect to be able to do it with a single
computer  commodity hardware,  but I'll accept having to use some kind of
cluster to handle stuff on the high end of this range.  Optimization and
good index structures would be essential to this.

Beyond that,  it's pretty exciting to explore what's possible with
synthetic graphs (on the side of the software stack facing the user) and
with named graph tags on the inside. For instance,  synthetic graphs
could specify what sort of inference is used to extend the graph:  much as
early versions of Cyc had multiple get() functions,  we could have some
synthetic graphs with practically no inference capability,  and other ones
that go to extremes (analogical reasoning,  CWA) to answer questions.

 For instance,  I think physical partitioning is going to become extremely
important for large-scale RDF systems:  named graph tags would be an
effective mechanism to route triples to specialized storage mechanisms:  for
instance,  I might want to route 20,000 upper ontology triples to a
specialized in-RAM engine that does expensive inference operations,  route a
web link graph with 5 billion triples to a specialized storage engine that
does extreme compression,  etc.


Context Tags, Context Sets and Beyond Named Graphs...

2010-01-18 Thread Paul Houle
For a while I've been struggling with a number of practical problems working
in RDF.  Some of these addressed by Named Graphs as they currently exists,
but others aren't.

Over the weekend I had an idea for something that I think is highly
expressive but also can be implemented efficiently.

The idea is that the context of triple can be,  not a name,  but a
collection of tags that work like tags on delicious,  flickr,  etc.  Tags
are going to be namespaced like RDF properties,  of course,  but they could
have meanings like:

#ImportedFromDBpedia3.3
#StoredInPhysicalPartition7
#ConfidentialSecurityLevel
#NotTrue
#InTheStarTrekUniverse
#UsedInProjectX
#UsedInProjectY
#VerifiedToBeTrue
#HypothesisToBeTested

Individually I call these Context Tags,  and the set of them that is
associated with a triple is a Context Set.

Now,  named graphs can be composed from boolean combination of tags,  such
as

AND(#ImportedFromDbPedia3.3,#InTheStarTrekUniverse)

NOT(#NotTrue)

AND(NOT(#ConfidentialSecurityLevel),OR(#UsedInProjectX,#UsedInProjectY))

===

Note that this is a feature of the underlying storage layer that can be
exploited by layers of the RDF store that are above:  for instance,
 something just above the storage layer could hash the subject URL and then
pass a physical partition tag to that the actual storage layer.  Similarly,
 security rules could be applied automatically.

===

There are many details to be filled in and features that could be added.
 For instance,  I could image that it might be useful to allow multiple
context sets to be attached to a triple,  for instance,  if you had the
triple

:WarpDrive :isA :SpacePropulsionDevice

it might be desirable to assert this as

#RealWorld #NotTrue

and to also assert that it is

#InTheStarTrekUniverse

I've thought through the implications of this less,  however.


The immediate use for this system that I see is that existing query
mechanisms for named graphs can be applied to the computed graphs.
 Inference mechanisms that can do other things with context tags is a wide
open question.

===

Anyhow,  I want this pretty bad.  (i) If you're selling this,  I'm buying;
 (ii) if you want to build this,  contact me.


Re: Ontology Wars? Concerned

2009-11-23 Thread Paul Houle
On Wed, Nov 18, 2009 at 2:23 PM, Nathan nat...@webr3.org wrote:


 I'm finding the path to entry in to the linked open data world rather
 difficult and confusing, and only for one specific reason - ontologies;
 it /feels/ like there are some kind of ontology wars going on and I can
 never get a definitive clear answer.


An ontology war is preferable to the alternative:  the one ring that rules
them all.

If you're trying to develop an ontology for topic X,  it's usually easy to
make one that's good but obviously not perfect:  let's say, 95% correct.

You need to cross an uncanny valley in the attempt to go from 95% to 100%,
 and often things get worse rather than better.  This is one of the reasons
why Cyc is perceived as a failure:  although it was trying to model the
common sense knowledge that we all share,  the actual structures in Cyc
that try to represent everything in a consistent way are bizzare,
 counterintuitive and certainly not representative of how people think,  no
matter how correct they may be.

People don't have a completely consistent taxonomy of the world either;
 they have models of different parts of reality that they'll mesh when they
need to mesh them.  My 94% correct version of topic X might be great for
what I'm doing w/ topic X and your 96% version is great for what you're
doing.  Trying to build one system that's perfect might result in something
that's not as good for what we're doing...  But in the long term we do need
tools that let us mesh these easily.

SPARQL + OWL can take us part of the way in that direction,  but really,  we
need something better in that direction,  largely because of the many
almost the same as relationships that are out there...


Re: Alternatives to OWL for linked data?

2009-07-24 Thread Paul Houle
On Fri, Jul 24, 2009 at 9:30 AM, Axel Rauschmayer a...@rauschma.de wrote:


 While it's not necessarily easier to understand for end users, I've always
 found Prolog easy to understand, where OWL is more of a challenge.

 So what solutions are out there? I would prefer description logic
 programming to OWL. Does Prolog-like backward-chaining make sense for RDF?
 If so, how would it be combined with SPARQL; or would it replace it? Or
 maybe something frame-based?

 Am I making sense? I would appreciate any pointers, hints and insights.



 I've got some projects in the pipe that are primarily based on Dbpedia
and Freebase,  but I'm incorporating data from other sources as well.  The
core of this is a system called Isidore which is a specialized system for
handling generic databases.
 My viewpoint is that there are certain kinds of reasoning that are best
done in a specialized way;  for instance,  the handling of identities,
 names and categories (category here includes the Dbpedia ontology and
Freebase types as well as internally generated.  For instance, a common task
is looking up an object by name.  Last time I looked,  there were about 10k
Wikipedia articles that had names that differed only by capitalization;
 most of the time you want name-lookups to be case-insensitive,  but you
still want addressability for the strange cases.

Wikipedia also has a treasure trove of information about disambiguation.
 The projects I do are about specific problem domains,  say animals,  cars,
 or video games:  I can easily qualify a search for Jaguar against a
problem domain and get the right dbpedia resource.

The core of identity,  naming and category information is small:  it's
easy to handle and easy to construct from Dbpedia and Freebase dumps.  From
the core it's possible to identify a problem domain and import data from
Dbpedia,  Freebase and other sources to construct a working database.

---

You might say that this is too specialized,  but this is the way the
brain works.  It's got specific modules for understanding particular problem
domains (faces,  people,  space,  etc.)  It's not so bad because the number
of modules that you need is finite.  Persons and Places represent a large
fraction of Dbpedia,  so reasoning about people and GIS can get you a lot of
mileage.  Freebase has particularly rich collection of data about musical
recordings and

I'm not sure if systems like OWL,  etc are really the answer -- we might
need something more like Cyc (or own brain) that has a lot of specialized
knowledge about the world embedded in it.



I see reification as an absolute requirement.  Underlying this is the
fact that generic databases are full of junk.  I'm attracted to Prolog-like
systems (Datalog?) but conventional logic systems are easily killed by
contradictory information.  This becomes a scalability limitation unless
you've got a system that is naturally robust to junk data.  You've also got
to be able to do conventional filtering:  you've got to be able to say
Triple A is wrong,  I don't trust triples from source B,  Source C uses
predicate D incorrectly,  Don't believe anything that E says about subject
F.  To deal with the (existing and emerging) semspam threat,  we'll also
need the same kind of probabilistic filtering that's used for e-mail and
blog comments.  (Take a look at the external links Dbpedia table if you
don't believe me)

The biggest challenge I see in generic databases is fiction.  Wikipedia
has a shocking amount of information about fiction:  this is a both an
opportunity and a danger.  For one thing,  people love fiction -- a G.P.A.I.
certainly needs to be able to appreciate fiction in order to appreciate the
human experience  On the other hand,  any system that does reasoning about
physics needs to tell the difference between

http://en.wikipedia.org/wiki/Minkowski_space

and

http://en.wikipedia.org/wiki/Minovsky_Physics#Minovsky_Physics

Also,  really it's all fiction when it comes down to it.  When a robocop
shows up at the scene of a fight,  it's going to hear contradictory stories
about who punched who first.  It's got to be able to listen to contradictory
stories and keep them apart,  and not fall apart like a computer from a bad
sci-fi movie.

---

Microtheories?  Nonmonotonic logic?

Perhaps.

You can go ahead and write standards and write papers about systems that
ignore the problems above,  but you're not going to make systems that work,
 on an engineering basis,  unless you confront them.