[ https://issues.apache.org/jira/browse/ANY23-339?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420849#comment-16420849 ]
Hudson commented on ANY23-339: ------------------------------ SUCCESS: Integrated in Jenkins build Any23-trunk #1549 (See [https://builds.apache.org/job/Any23-trunk/1549/]) ANY23-339 fixes itemscope hashcode collision problem, allows absolute (hans: rev a1b72b720a2cdb2802fd8e82856ee67702d002cd) * (edit) core/src/test/java/org/apache/any23/extractor/microdata/MicrodataExtractorTest.java * (edit) core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java > Microdata extractor can sometime merge two different itemscopes into one > ------------------------------------------------------------------------ > > Key: ANY23-339 > URL: https://issues.apache.org/jira/browse/ANY23-339 > Project: Apache Any23 > Issue Type: Bug > Components: extractors > Affects Versions: 2.2 > Reporter: Hans Brende > Assignee: Hans Brende > Priority: Major > Fix For: 2.3 > > > The microdata extractor calculates the *subject* of a triple as the > *hashCode()* of the itemscope. > Java's hashCode() method (returning a 32-bit integer) is not guaranteed to be > collision-free. (Especially so in this case, since the ItemScope.hashCode() > method is not written very well). > This means that two microdata items can accidentally be merged into one. > Here's the line that needs to be changed: > [https://github.com/apache/any23/blob/316b4ec0d6285a204789792084caf012c000b196/core/src/main/java/org/apache/any23/extractor/microdata/MicrodataExtractor.java#L439] > I recommend changing > {code} > subject = RDFUtils.getBNode(Integer.toString(itemScope.hashCode())); > {code} > to > {code} > subject = RDFUtils.bnode(); > {code} > We could also use {{itemScope.getItemId()}} if it's not null, even if it's > not a URL. An example of one such id possible is: > {code} > urn:isbn:0-330-34032-8 > {code} > Edit: according to the [microdata > spec|https://www.w3.org/TR/microdata-rdf/#dfn-absolute-url], > {{urn:isbn:0-330-34032-8}} *is* an absolute URL. Since their definition of > URL seems to correspond more closely to our definition of URI, we should be > checking for absolute urls with {{URI.isAbsolute()}} rather than with > {{URL.getProtocol() != null}} -- This message was sent by Atlassian JIRA (v7.6.3#76005)