[ 
https://issues.apache.org/jira/browse/ANY23-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420756#comment-16420756
 ] 

ASF GitHub Bot commented on ANY23-339:
--------------------------------------

Github user lewismc commented on the issue:

    https://github.com/apache/any23/pull/67
  
    I haven't read The Reality Dysfunction... but the PR look good :)


> Microdata extractor can sometime merge two different itemscopes into one
> ------------------------------------------------------------------------
>
>                 Key: ANY23-339
>                 URL: https://issues.apache.org/jira/browse/ANY23-339
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: extractors
>    Affects Versions: 2.2
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> The microdata extractor calculates the *subject* of a triple as the 
> *hashCode()* of the itemscope.
> Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be 
> collision-free. (Especially so in this case, since the ItemScope.hashCode() 
> method is not written very well).
> This means that two microdata items can accidentally be merged into one.
> Here's the line that needs to be changed: 
> [https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439]
> I recommend changing 
> {code}
> subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode()));
> {code}
> to
> {code}
> subject = RDFUtils.bnode();
> {code}
> We could also use {{itemScope.getItemId()}} if it's not null, even if it's 
> not a URL. An example of one such id possible is:
> {code}
> urn:isbn:0-330-34032-8
> {code}
> Edit: according to the [microdata 
> spec|https://www.w3.org/TR/microdata-rdf/#dfn-absolute-url], 
> {{urn:isbn:0-330-34032-8}} *is* an absolute URL. Since their definition of 
> URL seems to correspond more closely to our definition of URI, we should be 
> checking for absolute urls with {{URI.isAbsolute()}} rather than with 
> {{URL.getProtocol() != null}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to