Hans Brende created ANY23-339:
---------------------------------
Summary: Microdata extractor can sometime merge two different
itemscopes into one
Key: ANY23-339
URL: https://issues.apache.org/jira/browse/ANY23-339
Project: Apache Any23
Issue Type: Bug
Components: extractors
Affects Versions: 2.2
Reporter: Hans Brende
Fix For: 2.3
The microdata extractor calculates the *subject* of a triple as the
*hashCode()* of the itemscope.
Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be
collision-free. (Especially so in this case, since the ItemScope.hashCode()
method is not written very well).
This means that two microdata items can accidentally be merged into one.
Here's the line that needs to be changed:
[https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439]
I recommend changing
{code}
subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
{code}
to
{code}
subject = RDFUtils.bnode();
{code}
We could also use {{itemScope.getItemId()}} if it's not null, even if it's not
a URL. An example of one such id possible is:
{code}
urn:isbn:0-330-34032-8
{code}
This information is currently not recorded anywhere in the triples, which is an
issue in its own right!
Another possible id is {{itemScope.getId()}}, which is the DOM id. Will have to
visit the spec to see if this id is a valid subject for the triple.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)