Guys,
the Web of Data cannot rely on mass data crawling of the whole Web but must combine cached data with federated on-demand queries. Structured data requires much faster update cycles than typical text-based Web indices. For example, Google and Yahoo can rely on the fact that "http://www.cnn.com"; is relevant for "news". Such will not change within minutes. And both Google and Yahoo need up to several weeks to visit your page again.

When it comes to structured price and availability information, your data may become outdated within hours, if not seconds. Think of eBay auctions, hotel or flight availability, etc.

So it will boil down to technology that combines (1) crawling and caching rather stable data sets with (2) distributing queries and parts of queries among the right SPARQL endpoints (whatever actual DB technology they expose).

You can keep a text index of the whole Web, if crawling cycles in the order of magnitude of weeks are fine. For structured, linked data that exposes dynamic database content, "dumb" crawling and caching will not scale.

If the DB technology is able to involve the right set of endpoints for parts of the query, why would you need a complete replication of all databases in the world inside one huge repository?

That repository will be a million-node cluster anyway. Why not directly use the millions of nodes that provide the data and cache just the endpoint meta-data?

Martin



Giovanni Tummarello wrote:
With respect to crawling and "scraping" or "sponging" or .. "trying to
guess" based on partial fragments of structured information i can say
3 thngs

a) No, we're not doing it at the moment, we are only covering those
who chose to put structured semantics. Some book stuff shows up in
Sig.ma .. e.g. http://sig.ma/search?q=frank+van+harmelen&sources=100
bookfinder, our jerome digital library installation, but the triplees
they provide are scarce and dont contribute much.  It would take so
little for this to improve on their side i believe.

b) No, we are not religious about this. We have talked about it
several times, it might make sense to try to understand as much as the
web as possible and index it. Maybe we'll do it in the future for
selected fractions of the web to show how it looks

c) crawling should be just one mean of acquiring the semantic web. in
case of bestbuy or other large retailers where prices change possibly
everyday crawling as a mean to emulate a simple.. call to a web
service seems really not the smart thing to do. Will data providers
really support with data dumps?

cheers
Giovanni


On Sat, Oct 17, 2009 at 3:32 PM, Juan Sequeda <[email protected]> wrote:
But Sindice could at least crawl Amazon.
It would be great to use sig.ma to create a "meshup" with the amazon data.


Juan Sequeda, Ph.D Student
Dept. of Computer Sciences
The University of Texas at Austin
www.juansequeda.com
www.semanticwebaustin.org


On Sat, Oct 17, 2009 at 9:28 AM, Martin Hepp (UniBW)
<[email protected]> wrote:
I don't think so, because this would require that Sindice crawled the
whole regular web and checked the Spongers for each URL (sic!).

Juan Sequeda wrote:

Does Sindice crawl this (or any other semantic web search engines)?
Juan Sequeda, Ph.D Student
Dept. of Computer Sciences
The University of Texas at Austin
www.juansequeda.com
www.semanticwebaustin.org


On Sat, Oct 17, 2009 at 4:24 AM, Martin Hepp (UniBW) <
[email protected]> wrote:



Dear all:

I just found out that the Virtuoso Sponger technology is even more
powerful than I thought.

Briefly: "Spongers" create rich GoodRelations (and other RDF) meta-data
for existing Web pages on-the-fly. Other than traditional
screen-scraping approaches, Spongers reuse public APIs and other
techniques, so the data is of unprecedented degree of structure.

Now, this can be directly used in arbitrary queries... by simply using
the URI of the *existing* HTML Web page in the FROM clause of a SPARQL
query.

Example:




http://www.amazon.com/Semantic-Web-Real-World-Applications-Industry/dp/0387485309

is a Web page in plain HTML offering a book. Amazon does not yet produce
GoodRelations meta-data on their pages.

If you go to

   http://uriburner.com/sparql

and paste the URI in the "Default Graph URI " field and select "Retrieve
remote RDF for all missing source graphs", then a query like

  "SELECT * WHERE {?s ?p ?o} LIMIT 50"

returns a fully-fledged GoodRelations description for that page - as if
Amazon was already supporting GoodRelations for each of its > 4 million
items!

There are spongers for BestBuy, eBay, Zillow, and many other types of
resources.

Wow!

Congrats to Kingsley and his team!

Best wishes

Martin Hepp

--
--------------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  [email protected]
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
        http://www.heppnetz.de/ (personal)
skype:   mfhepp
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================

Webcast:
http://www.heppnetz.de/projects/goodrelations/webcast/

Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey

Talk at the Semantic Technology Conference 2009:
"Semantic Web-based E-Commerce: The GoodRelations Ontology"


http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287

Overview article on Semantic Universe:


http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html

Project page:
http://purl.org/goodrelations/

Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelations

Tutorial materials:
CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey


http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709








--
--------------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  [email protected]
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
         http://www.heppnetz.de/ (personal)
skype:   mfhepp
twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================

Webcast:
http://www.heppnetz.de/projects/goodrelations/webcast/

Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey

Talk at the Semantic Technology Conference 2009:
"Semantic Web-based E-Commerce: The GoodRelations Ontology"

http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287

Overview article on Semantic Universe:

http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html

Project page:
http://purl.org/goodrelations/

Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelations

Tutorial materials:
CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on
Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey

http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709




--
--------------------------------------------------------------
martin hepp
e-business & web science research group
universitaet der bundeswehr muenchen

e-mail:  [email protected]
phone:   +49-(0)89-6004-4217
fax:     +49-(0)89-6004-4620
www:     http://www.unibw.de/ebusiness/ (group)
        http://www.heppnetz.de/ (personal)
skype: mfhepp twitter: mfhepp

Check out GoodRelations for E-Commerce on the Web of Linked Data!
=================================================================

Webcast:
http://www.heppnetz.de/projects/goodrelations/webcast/

Recipe for Yahoo SearchMonkey:
http://www.ebusiness-unibw.org/wiki/GoodRelations_and_Yahoo_SearchMonkey

Talk at the Semantic Technology Conference 2009: "Semantic Web-based E-Commerce: The GoodRelations Ontology"
http://www.slideshare.net/mhepp/semantic-webbased-ecommerce-the-goodrelations-ontology-1535287

Overview article on Semantic Universe:
http://www.semanticuniverse.com/articles-semantic-web-based-e-commerce-webmasters-get-ready.html

Project page:
http://purl.org/goodrelations/

Resources for developers:
http://www.ebusiness-unibw.org/wiki/GoodRelations

Tutorial materials:
CEC'09 2009 Tutorial: The Web of Data for E-Commerce: A Hands-on Introduction to the GoodRelations Ontology, RDFa, and Yahoo! SearchMonkey http://www.ebusiness-unibw.org/wiki/Web_of_Data_for_E-Commerce_Tutorial_IEEE_CEC%2709


<<attachment: martin_hepp.vcf>>

Reply via email to