Summary:

I believe that there are use cases for RDFa - and that they are precisely the sort of thing that Yahoo, Google, Ask, and their ilk are not going to be interested in, since they are based on solving problems that those search engines do not efficiently solve, such as (among others) using private data or dealing with trustworthy data to answer very specific questions automatically.

If Ian needs to understand the Semantic Web Industry and why people have invested in the RDFa proposal, then it is important to identify the right questions, and having him alone identify the sub-questions when he doesn't understand the issue isn't going to help him make a well-informed decision.

Some of Ian's questions are discussed here. I cut the mail "short" since I think it is already too long for many people, which means that the debate will simply pass without their reading or input.

On Wed, 31 Dec 2008 20:46:01 +1100, Ian Hickson <i...@hixie.ch> wrote:

One of the outstanding issues for HTML5 is the question of whether HTML5
should solve the problem that RDFa solves, e.g. by embedding RDFa
...
Before I can determine whether we should solve this problem, and before I
can evaluate proposals for solving this problem, I need to learn what the
problem is.

Earlier this year, there was a thread on RDFa on the WHATWG list. Very
little of the thread focused on describing the problem. This e-mail is an
attempt to work out what the problem is based on that feedback, on
discussions at the recent TPAC, and on other research I have done.


On Mon, 25 Aug 2008, Manu Sporny wrote:
Ian Hickson wrote:
> I have no idea what problem RDFa is trying to solve. I have no idea
> what the requirements are.

Web browsers currently do not understand the meaning behind human
statements or concepts on a web page. If web browsers could understand
that a particular page was describing a piece of music, a movie, an
event, a person or a product, the browser could then help the user find
more information about the particular item in question. It would help
automate the browsing experience. Not only would the browsing experience
be improved, but search engine indexing quality would be better due to a
spider's ability to understand the data on the page with more accuracy.

Let's see if I can rephrase that in terms of requirements.

* Web browsers should be able to help users find information related to
  the items that page they are looking at discusses.

* Search engines should be able to determine the contents of pages with
  more accuracy than today.

Is that right?

Are those the only requirements/problems that RDFa is attempting to
address? If not, what other requirements are there?

I don't think so. I think there are some other requirements:

A standard way to include arbitrary data in a web page and extract it for machine processing, without having to pre-coordinate their data models.

Since many people use RDF as an interchange, storage and processing format for this kind of data (because it provides for automated mapping of data from one schema to many others, without requiring anyone to touch the original schemata or agree in advance how they should be created), I believe there is a requirement for a method that allows third parties to include RDF data in, and extract it from information encoded within an HTML page.

The Microformats community has done a remarkable job of working on the
web semantics problem, creating several different methods of expressing
common human concepts (contact information (hCard), events (hCalendar),
and audio recordings (hAudio)).

Right; with Microformats, each Microformat has its own problem space and
thus each one can be evaluated separately. It is much harder to evaluate
something when the problem space is as generic as it appears RDFa's is.

The point is that there are a very large set of very small problem spaces relevant to a small group at a time. Like RDF itself, RDFa is meeting the problem of allowing these people to share machine-processable data without previously coordinating their approach.

The results of the first set of Microformats efforts were some pretty
cool applications, like the following one demonstrating how a web
browser could forward event information from your PC web browser to your
phone via Bluetooth:

http://www.youtube.com/watch?v=azoNnLoJi-4

It's a technically very interesting application. What has the adoption
rate been like? How does it compare to other solutions to the problem,
like CalDav, iCal, or Microsoft Exchange? Do people publish calendar
events much? There are a lot of Web-based calendar systems, like MobileMe
or WebCalendar. Do people expose data on their Web page that can be used
to import calendar data to these systems?

In some cases this data is indeed exposed to Webpages. However, anecdotal evidence (which unfortunately is all that is available when trying to study the enormous collections of data in private intranets) suggests that this is significantly more valuable when it can be done within a restricted access website.

...
In short, RDFa addresses the problem of a lack of a standardized
semantics expression mechanism in HTML family languages.

A standardized semantics expression mechanism is a solution. The lack of a solution isn't a problem description. What's the problem that a
standardized semantics expression mechanism solves?

There are many many small problems involving encoding arbitrary data in pages - apparently at least enough to convince you that the data-* attributes are worth incorporating.

There are many cases where being able to extract that data with a simple toolkit from someone else's content, or using someone else's toolkit without having to tell them about your data model, solves a local problem. The data-* attributes, because they do not represent a formal model that can be manipulated, are insufficient to enable sharing of tools which can extract arbitrary modelled data.

RDF, in particular, also provides estabished ways of merging existing data encoded in different existing schemata.

There are many cases where people build their own dataset and queries to solve a local problem. As an example, Opera is not intersted in asking Google to index data related to internal developer documents, and use it to produce further documentation we need. However, we do automatically extract various kinds of data from internal documents and re-use it. While Opera does not in fact use the RDF toolstack for that process, there are many other large companies and organisations who do, and who would benefit from being able to use RDFa in that process.

RDFa not only enables the use cases described in the videos listed
above, but all use cases that struggle with enabling web browsers and
web spiders understand the context of the current page.

It would be helpful if we could list these use cases clearly and in detail so that we could evaluate the solutions proposed against them.

Here's a list of the use cases and requirements so far in this e-mail:

* Web browsers should be able to help users find information related to
  the items that page they are looking at discusses.

* Search engines should be able to determine the contents of pages with
  more accuracy than today.

* Exposing calendar events so that users can add those events to their
  calendaring systems.

* Exposing music samples on a page so that a user can listen to all the
  samples.

* Getting data out of poorly written Web pages, so that the user can find
  more information about the page's contents.

* Finding more information about a movie when looking at a page about the
  movie, when the page contains detailed data about the movie.

Can we list some more use cases?


Here are some other questions that I would like the answers to so that I
can better understand what is being proposed here:

Does it make sense to solve all these problems with the same syntax?

That depends on the answers to your next two questions.

Moreover, that is not actually a very good question in this case. I think the judgement call should be whether a syntax that allows people to solve the identified problem set consistently is sufficiently valuable (measured in terms of the advantages weighed against the disadvantages) to justify being part of HTML5.

What are the disadvantanges of doing so?

I am not sure.

What are the advantages?

Many people will be able to use standard tools which are part of their existing infrastructure to manipulate important data. They will be able to store that data in a visible form, in web pages. They will also be able to present the data easily in a form that does not force them to lose important semantics.

People will be able to build toolkits that allow for processing of data from webpages without knowing, a priori, the data model used for that information.

What is the
opportunity cost of encouraging everyone to expose data in the same way?

I don't know. I don't see much of an opportunity cost.

What is the cost of having different data use specialised formats?

If the data model, or a part of it, is not explicit as in RDF but is implicit in code made to treat it (as is the case with using scripts to process things stored in arbitrarily named data-* attributes, and is also the case in using undocumented or semi-documented XML formats, it requires people to understand the code as well as the data model in order to use the data. In a corporate situation where hundreds or tens of thousands of people are required to work with the same data, this makes the data model very fragile.

Such considerations also apply to larger communities, for example those dealing with complex scientific information.

Do publishers actually want to use a common data format?

It would appear so - even in cases where they don't want to publish their data in such an easy-to-use format for commercial reasons.

How have past efforts in creating data formats fared?

Some have been pretty successful. Dublin Core is a general format for labelling content that is widely used. MARC records have been very successful.

Are enough data providers actually willing to expose their data in a
machine readable manner for this to be truly useful?

To make this truly useful it doesn't need to be exposed to the public. It would appear that organisations are prepared to make large investments in RDF data whether they expose them or not (and some very large ones do expose data), which suggests that this data is truly useful.

If data providers
will be willing to expose their data as RDFa, why are they not already
exposing their data in machine-readable form today?

 - For example, why doesn't Amazon expose a CSV file of your usage
   history, or an Atom feed of the comments for each product, or an
   hProduct annotated form of their product data? (Or do they? And if so,
   do we know if users use this data?)

Why would they need to?

- As another example, why doesn't Craigslist like their data being reused in mashups? Would they be willing to allow their users to reuse
   their data in these new and exciting ways, or would they go out of
   their way to prevent the data from being accessible as soon as a
   critical mass of users started using it?

This is a key question. Why *should* a data provider be required to offer their product (data) for other people to use, in order to demonstrate that the data is useful. Google, a large provider of data, insists on certain conditions being met before it makes its services available, and that seems perfectly reasonably to me.

Whether Craigslist actively attempts to make their data easier to aggregate, or actively avoids facilitating that process, strikes me as irrelevant to the question of whether there is value in enabling them to do so. Because large organisations specialising in gathering people's data, from Flickr to Google and Facebook to Government taxation departments are not the only consumers and producers of data that determine value for users.

It would seem important that the Web easily enable small-time users of data to efficiently communicate with one another, without the need to have one of the giants as an intermediary. When libraries in the Dominican Republic want to share data, and librarians in Léon want to use that data, it seems that the Web should facilitate that without resorting to intermediaries like Amazon or Yahoo! and since we already have the technology to do so in a way that enables very powerful data models to be used without requiring coordination, it seems odd that you don't even understand how this could be valuable.

What will the licensing situation be like for this data? Will the licenses allow for the reuse being proposed to solve the problems and
use cases listed above?

In some cases yes, and in some cases no. In other words, making such data available does not distort natural market conditions one way or another.

How are Web browsers going to expose user interfaces to answer user
questions?

I am glad to see that you think user interface behaviour is in fact important to the process of specifying HTML (I had been under the impression that you believed the spec should not touch on it). There are various query systems already available in browsers, from the search engine in Opera that lets you do a free-text search on pages stored in your history to Tabulator - a substantial RDF browser available as a Widget for Opera or as an extension to Firefox, that allows for a variety of pre-configured questions as well as free-form questions.

Can only previously configured, hard-coded questions be asked,
or will Web browsers be able to answer arbitrary free-form questions from
users using the data exposed by RDFa?

Both of these are possible. The value of RDFa is that it actually supports the possibility of asking free-form questions by using a data model that is sufficiently well specified to enable constructions of tools that are not dependent on being preconfigured to recognise the exact type of data being queried (unlike, say, microformats, which require an intermediate agreement to enable people to extract the data, and don't provide for merging data of different types for rich queries).

How are Web browsers that expose this data going to handle data that is
not exposed in the same format? For example, if a site exposes data in
JSON or CSV format rather than RDFa, will that data be available to the
user in the same way?

Who cares? But for those who do, this is up to Web browsers. They can choose to implement transformations between some particular CSV data and RDFa. The difficulty here (and therefore illustration of the value of RDFa) is that CSV data has important details of the meaning of the data only available out of band in looking at how the data is recorded, while RDF allows for automating the process of merging data originally encoded in different RDFa vocabularies.

...

What is the expected strategy to fight spam in these systems? Is it
expected that user agents will just collect data in the background? If so, how are user agents expected to distinguish between pages that have
reliable data and pages that expose data that is misleading or wrong?

Aggregating data in real-time is relatively expensive, so is a strategy more suited to dealing with asking new questions. Typical systems so far have aggregated data in the background to deal with known queries (one example is Google, which crawls pages in advance, anticipating searches that match terms against the content of those pages), and use live querying for cases where the result cannot reliably be stored (e.g. airline reservation systems like TravelJungle or LastMinute which determine price and availability based on constantly changing data).

Different use cases will imply different strategies for fighting spam. Some obvious ones are to rely on trusted sites and secured and signed data, to use reputation managers, to follow the "shape" of data over time so that anamolies can be highlighted and checked more carefully (in the manner of Bayesian filters for email). Some use cases don't care much about spam, or are not very interesting to spammers. Some use cases are private data anyway.

- Systems like Yahoo! Search and Live Search expend extraordinary amounts of resources on spam fighting technology; such technology
   would not be accessible to Web browsers unless they interacted with
   anti-spam services much like browsers today interact with
   anti-phishing services.

Actually, at least Opera already incorporates anti-spam technology in its mail client. Where browsers are the primary consumers of data there is nothing at all to suggest that they cannot incorporate anti-spam technology directly. (Indeed, the POWDER specification is designed in part to make that easy - and it is exactly the sort of data that might sometimes be usefully encoded in RDFa since it is based on an RDF model).

   Yet anti-phishing services have been controversial, since they involve
   exposing the user's browsing history to third parties; anti-spam
   services would be a significantly greater problem due to the vastly
   greater level of spamming compared to phishing. What is the solution
   proposed to tackle this problem?

It is not clear that this problem is any different in the context of RDFa to the general problem already faced by the Web. In general, the solutions proposed are the same as those already used on the Web, and of course those in development.

 - Even with a mechanism to distinguish trusted sites from spammy sites,
   how would Web browsers deal with trusted sites that have been subject
   to spamming attacks? This is common, for instance, on blogs or wikis.

Right. But that doesn't mean we question whether browsers should enable blogs or wikis. Why would RDFa data be different enough to make this question relevant?

These are not rhetorical questions, and I don't know the answers to them.

Some of them seem to be poorly phrased, although if you don't understand why people have been working on this technology and why they think it would be valuable to have it available in HTML I guess that is almost inevitable.

We need detailed answers to all those questions before we can really
evaluate the various proposals that have been made here.

No, we apparently need you to personally understand the Semantic Web Industry. Determining answers to the questions which are important is probably helpful, but also helpful is explaining when your questions are irrelevant because they are based on a lack of understanding. This is not intended as a slight, but to clarify the process required to have something as large as the "Sematic Web" (capital letters, implying the whole W3C activity, the industry based around RDF, and so on) evaluated for potential inclusion in the HTML5 specification.

I presume the same would apply if the "Web Services" people came and asked to have all of their things included in HTML, and offered a specification that could be used to achieve their desires.
...

[not clear what the context was here, so citing as it was]
> I don't think more metadata is going to improve search engines. In
> practice, metadata is so highly gamed that it cannot be relied upon.
> In fact, search engines probably already "understand" pages with far
> more accuracy than most authors will ever be able to express.

You are correct, more erroneous metadata is not going to improve search
engines. More /accurate/ metadata, however, IS going to improve search
engines. Nobody is going to argue that the system could not be gamed. I
can guarantee that it will be gamed.

However, that's the reality that we have to live with when introducing
any new web-based technology. It will be mis-used, abused and corrupted.
The question is, will it do more good than harm? In the case of RDFa
/and/ Microformats, we do think it will do more good than harm.

For search engines, I am not convinced. Google's experience is that
natural language processing of the actual information seen by the actual
end user is far, far more reliable than any source of metadata. Thus from
Google's perspective, investing in RDFa seems like a poorer investment
than investing in natural language processing.

Indeed. But Google is something of an edge case, since they can afford to run a huge organisation with massive computer power and many engineers to address a problem where a "near-enough" solution brings themn the users who are in turn the product they sell to advertisers. There are many other use cases where a small group of people want a way to reliably search trusted data.

From global virtual library systems to a single websites, there are many others who find that processing structured data is more efficient for their needs than doing free-text analysis of web pages (something that they effectively contract out to Google, Ask, Yahoo! and their many competitors who specialise in it). Some of these are the people whe have decided that investing in RDFa is a far more valuable exercis than trying to out-invest Google in natural language processing.

This email is already too long for most people to get through it :( I believe that this discussion is going to last for some time (I cannot imagine why, given the HTML timeline, it would need to be resolved before June), so there will be time for others to discuss more fully the many points Ian raises as ones he would like to understand.

cheers

Chaals

--
Charles McCathieNevile  Opera Software, Standards Group
    je parle français -- hablo español -- jeg lærer norsk
http://my.opera.com/chaals       Try Opera: http://www.opera.com

Reply via email to