Hans Brende created ANY23-339:
---------------------------------

             Summary: Microdata extractor can sometime merge two different 
itemscopes into one
                 Key: ANY23-339
                 URL: https://issues.apache.org/jira/browse/ANY23-339
             Project: Apache Any23
          Issue Type: Bug
          Components: extractors
    Affects Versions: 2.2
            Reporter: Hans Brende
             Fix For: 2.3


The microdata extractor calculates the *subject* of a triple as the 
*hashCode()* of the itemscope.

Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be 
collision-free. (Especially so in this case, since the ItemScope.hashCode() 
method is not written very well).

This means that two microdata items can accidentally be merged into one.

Here's the line that needs to be changed: 

[https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439]

I recommend changing 
{code}
subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
{code}
to
{code}
subject = RDFUtils.bnode();
{code}

We could also use {{itemScope.getItemId()}} if it's not null, even if it's not 
a URL. An example of one such id possible is:
{code}
urn:isbn:0-330-34032-8
{code}
This information is currently not recorded anywhere in the triples, which is an 
issue in its own right!

Another possible id is {{itemScope.getId()}}, which is the DOM id. Will have to 
visit the spec to see if this id is a valid subject for the triple.






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to