[ 
https://issues.apache.org/jira/browse/ANY23-339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hans Brende updated ANY23-339:
------------------------------
    Description: 
The microdata extractor calculates the *subject* of a triple as the 
*hashCode()* of the itemscope.

Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be 
collision-free. (Especially so in this case, since the ItemScope.hashCode() 
method is not written very well).

This means that two microdata items can accidentally be merged into one.

Here's the line that needs to be changed: 

[https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439]

I recommend changing 
{code}
subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
{code}
to
{code}
subject = RDFUtils.bnode();
{code}

We could also use {{itemScope.getItemId()}} if it's not null, even if it's not 
a URL. An example of one such id possible is:
{code}
urn:isbn:0-330-34032-8
{code}

Edit: according to the [microdata 
spec|https://www.w3.org/TR/microdata-rdf/#dfn-absolute-url], 
{{urn:isbn:0-330-34032-8}} *is* an absolute URL. Since their definition of URL 
seems to correspond more closely to our definition of URI, we should be 
checking for absolute urls with {{URI.isAbsolute()}} rather than with 
{{URL.getProtocol() != null}}




  was:
The microdata extractor calculates the *subject* of a triple as the 
*hashCode()* of the itemscope.

Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be 
collision-free. (Especially so in this case, since the ItemScope.hashCode() 
method is not written very well).

This means that two microdata items can accidentally be merged into one.

Here's the line that needs to be changed: 

[https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439]

I recommend changing 
{code}
subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
{code}
to
{code}
subject = RDFUtils.bnode();
{code}

We could also use {{itemScope.getItemId()}} if it's not null, even if it's not 
a URL. An example of one such id possible is:
{code}
urn:isbn:0-330-34032-8
{code}
This information is currently not recorded anywhere in the triples, which is an 
issue in its own right!

Another possible id is {{itemScope.getId()}}, which is the DOM id. Will have to 
visit the spec to see if this id is a valid subject for the triple.





> Microdata extractor can sometime merge two different itemscopes into one
> ------------------------------------------------------------------------
>
>                 Key: ANY23-339
>                 URL: https://issues.apache.org/jira/browse/ANY23-339
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: extractors
>    Affects Versions: 2.2
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> The microdata extractor calculates the *subject* of a triple as the 
> *hashCode()* of the itemscope.
> Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be 
> collision-free. (Especially so in this case, since the ItemScope.hashCode() 
> method is not written very well).
> This means that two microdata items can accidentally be merged into one.
> Here's the line that needs to be changed: 
> [https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439]
> I recommend changing 
> {code}
> subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
> {code}
> to
> {code}
> subject = RDFUtils.bnode();
> {code}
> We could also use {{itemScope.getItemId()}} if it's not null, even if it's 
> not a URL. An example of one such id possible is:
> {code}
> urn:isbn:0-330-34032-8
> {code}
> Edit: according to the [microdata 
> spec|https://www.w3.org/TR/microdata-rdf/#dfn-absolute-url], 
> {{urn:isbn:0-330-34032-8}} *is* an absolute URL. Since their definition of 
> URL seems to correspond more closely to our definition of URI, we should be 
> checking for absolute urls with {{URI.isAbsolute()}} rather than with 
> {{URL.getProtocol() != null}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to