[ https://issues.apache.org/jira/browse/ANY23-340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16420825#comment-16420825 ]
Hans Brende edited comment on ANY23-340 at 3/30/18 6:45 PM: ------------------------------------------------------------ [~lewismc] Do you know if rdfa version 1.0 triples are always a subset of rdfa 1.1 triples? If so, I could probably remove all doctypes without consequence. was (Author: hansbrende): [~lewismc] Do you know if rdfa version 1.0 triples are always a subset of rdfa 1.1 triples? > Any23 extraction does not pass Nutch plugin test > ------------------------------------------------ > > Key: ANY23-340 > URL: https://issues.apache.org/jira/browse/ANY23-340 > Project: Apache Any23 > Issue Type: Bug > Components: extractors > Affects Versions: 2.2 > Reporter: Hans Brende > Priority: Major > Fix For: 2.3 > > > When removing the [SAX parsing > filter|https://github.com/apache/nutch/blob/2934d4384901d4eda0aeecfa281bfbb2d9b9b0c1/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L111-L116] > from the Nutch Any23 plugin, the test case fails. > Cf. this pull request: https://github.com/apache/nutch/pull/306 > There are two test files: (1) > [microdata_basic.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/microdata_basic.html], > and (2) > [BBC_News_Scotland.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/BBC_News_Scotland.html]. > ---- > For (1), the test case expects 39 triples to be extracted. With the SAX > pre-filter, 39 triples are extracted. Without the SAX pre-filter, only 38 > triples are extracted. > The bad news is, BOTH OF THESE NUMBERS ARE WRONG. *40* triples should be > extracted. > *Without* the SAX pre-filter, the html-microdata extractor loses 2 triples to > ANY23-339, bringing the total to 38. > *With* the SAX pre-filter, it sees the *meta* element in the following code: > {code:html} > <span itemscope><meta itemprop="name" content="The Castle"></span> > {code} > And tries to wrap it in a *head* element: > {code:html} > <span itemscope="itemscope"></span> > </body><head><meta itemprop="name" content="The Castle"></meta></head><body> > {code} > Which the Jsoup pre-filter then throws out, as it should: > {code:html} > <span itemscope="itemscope"></span> > <meta itemprop="name" content="The Castle" /> > {code} > leaving us with an item *not wrapped in an itemscope* (-2 triples) (but would > be -2 anyway due to ANY23-339) and an EMPTY item scope (+1 triples), bringing > the total to 39. > ---- > The extraction fails (2) by failing to extract a total of 11 triples, *all of > which* have a predicate IRI equal to > "http://www.w3.org/1999/xhtml/vocab#role". > Of those 11 triples, 1 triple has the object IRI > "http://www.w3.org/1999/xhtml/vocab#navigation", 1 triple has the object IRI > "http://www.w3.org/1999/xhtml/vocab#search", 1 triple has the object IRI > "http://www.w3.org/1999/xhtml/vocab#contentinfo", and 8 triples have the > object IRI "http://www.w3.org/1999/xhtml/vocab#presentation". > All of these triples are being overlooked by the html-rdfa11 extractor. > The reason they are being overlooked is, apparently, because the document > type definition of the document *specifies XHTML+RDFa version 1.0*: > {code:html} > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" > "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> > {code} > When I either change the document type to XHTML+RDFa version *1.1*: > {code:html} > <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN" > "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> > {code} > or remove the doctype altogether, all 11 triples are extracted as expected. > So, this would be easily fixed just by removing doctypes from all documents. > Comments or insight anyone? > Question: does anyone know whether or not the rdfa version 1.0 triples > extracted from a page *are guaranteed to be a subset* of the rdfa version 1.1 > triples extracted? -- This message was sent by Atlassian JIRA (v7.6.3#76005)