Hans Brende created ANY23-340: --------------------------------- Summary: Any23 extraction does not pass Nutch plugin test Key: ANY23-340 URL: https://issues.apache.org/jira/browse/ANY23-340 Project: Apache Any23 Issue Type: Bug Components: extractors Affects Versions: 2.2 Reporter: Hans Brende Fix For: 2.3
When removing the [SAX parsing filter|https://github.com/apache/nutch/blob/2934d4384901d4eda0aeecfa281bfbb2d9b9b0c1/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L111-L116] from the Nutch Any23 plugin, the test case fails. Cf. this pull request: https://github.com/apache/nutch/pull/306 There are two test files: (1) [microdata_basic.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/microdata_basic.html], and (2) [BBC_News_Scotland.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/BBC_News_Scotland.html]. ---- For (1), the test case expects 39 triples to be extracted. With the SAX pre-filter, 39 triples are extracted. Without the SAX pre-filter, only 38 triples are extracted. The bad news is, BOTH OF THESE NUMBERS ARE WRONG. *40* triples should be extracted. *Without* the SAX pre-filter, the html-microdata extractor loses 2 triples to ANY23-339, bringing the total to 38. *With* the SAX pre-filter, it sees the *meta* element in the following code: {code} <span itemscope><meta itemprop="name" content="The Castle"></span> {code} And tries to wrap it in a *head* element: {code} <span itemscope="itemscope"></span> </body><head><meta itemprop="name" content="The Castle"></meta></head><body> {code} Which the Jsoup pre-filter then throws out, as it should: {code} <span itemscope="itemscope"></span> <meta itemprop="name" content="The Castle Content" /> {code} leaving us with an item *not wrapped in an itemscope* (-2 triples) and an EMPTY item scope (+1 triples), bringing the total to 39. ---- The extraction fails (2) by failing to extract a total of 11 triples, *all of which* have a predicate IRI equal to "http://www.w3.org/1999/xhtml/vocab#role". Of those 11 triples, 1 triple has the object IRI "http://www.w3.org/1999/xhtml/vocab#navigation", 1 triple has the object IRI "http://www.w3.org/1999/xhtml/vocab#search", 1 triple has the object IRI "http://www.w3.org/1999/xhtml/vocab#contentinfo", and 8 triples have the object IRI "http://www.w3.org/1999/xhtml/vocab#presentation". All of these triples are being overlooked by the html-rdfa11 extractor. The reason they are being overlooked is, apparently, because of the document type definition of the document, which is: {code} <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> {code} The problem seems to lie with the PUBLIC id alone. Changing the document type to: (1) {code} <!DOCTYPE html SYSTEM "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd"> {code} or (2) {code} <!DOCTYPE html> {code} or (3) {code} {code} results in all 11 triples being extracted as expected. So, this would be easily fixed just by removing doctypes from all documents. Comments or insight anyone? -- This message was sent by Atlassian JIRA (v7.6.3#76005)