Hi Hugh,

comment below,

On 19/07/10 08:22, Hugh Glaser wrote:
to answer to your question, Sindice will accept the document, perform
reasoning and index it as it is. However, Sindice is somehow robust to
this kind of "poisonous" data. Sindice is performing a particular kind
of reasoning that we call "context-dependent" reasoning [1], in which
inference is performed in the "context of the document". The inference
will only be true in the context of this document, and will not have a
global impact, i.e., will not alter the inference on other documents.
Therefore, Sindice avoids undesirable assertions. In fact, we do not
restrict the freedom of expression of data publishers as in other
approach like SAOR [2] where certain statements are considered invalid
and ignored.  Data publishers are allowed to reuse and extend ontologies
or existing entities in any manner, but the consequences of their
modifications will be confined in their own context, and will not alter
the intended semantics of the other RDF models published on the Web.
Cool.
Sounds really good that the inference part of Sindice is robust to this.
Although I guess if I use Sindice to find relevant documents for
dbpedia:Darby_Riordan and load them into my store, I am likely to end up
with a pretty poisonned store.
As you are saying, you are looking for relevant documents about dbpedia:Darby_Riordan. In this case, with an appropriate ranking, it is unlikely that poisonous/spamming documents will appear in the top-k results.
However, if somebody requests all documents stating<?s, owl:sameas,
dbpedia:Darby_Riordan>, Sindice will return you the document
http://data.totl.net/dave.rdf. But such problem can be tackled with
appropriate ranking methodologies (based on link analysis methods such
as [3]).
Poisonous documents published on the web are likely to not have
any incoming links (or only from other poisonous documents, but this can
be detected), and therefore will be ranked very low and will never
appear in the top-k search results.
Not sure of this.
Poisonous documents may well have many links to them (saying they are
poisonous?).
Good point, but in this case, it means that people agree on a certain vocabulary to point out poisonous documents. In this case, this information (meaning of the link) can be integrated into the ranking function. If a document has many incoming links, e.g., of type isPoisonous, then we can rank it lower. After, finding the right ranking function is another problem (and interesting problem), but it is possible.
This seems to me to be comparable to the citation problem, where a paper
gets very high citations because everyone cites it as being wrong.
Of course, sentiment analysis etc may help (and may be easier in the
semantic web), but pure reference count is dangerous.
The ranking should not be purely based on references, and it should also take into consideration the meaning of the links. Also, only taking the meaning of the links is dangerous. For example, if I create a link to a dbpedia:document saying it is poisonous, why should people trust me ? However, if there is a multitude of links saying that the document is poisonous, then we can have more confidence in the fact that the document is really poisonous.

Regards,
--
Renaud Delbru

Reply via email to