[jira] [Updated] (ANY23-340) Any23 extraction does not pass Nutch plugin test

Hans Brende (JIRA) Fri, 30 Mar 2018 11:46:04 -0700

     [ 
https://issues.apache.org/jira/browse/ANY23-340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hans Brende updated ANY23-340:
------------------------------
    Description: 
When removing the [SAX parsing 
filter|https://github.com/apache/nutch/blob/2934d4384901d4eda0aeecfa281bfbb2d9b9b0c1/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L111-L116]
 from the Nutch Any23 plugin, the test case fails.

Cf. this pull request: https://github.com/apache/nutch/pull/306

There are two test files: (1) 
[microdata_basic.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/microdata_basic.html],
 and (2) 
[BBC_News_Scotland.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/BBC_News_Scotland.html].

----
For (1), the test case expects 39 triples to be extracted. With the SAX 
pre-filter, 39 triples are extracted. Without the SAX pre-filter, only 38 
triples are extracted.

The bad news is, BOTH OF THESE NUMBERS ARE WRONG. *40* triples should be 
extracted.

*Without* the SAX pre-filter, the html-microdata extractor loses 2 triples to 
ANY23-339, bringing the total to 38.

*With* the SAX pre-filter, it sees the *meta* element in the following code:
{code:html}
<span itemscope><meta itemprop="name" content="The Castle"></span>
{code}

And tries to wrap it in a *head* element:
{code:html}
<span itemscope="itemscope"></span>
</body><head><meta itemprop="name" content="The Castle"></meta></head><body>
{code}

Which the Jsoup pre-filter then throws out, as it should:
{code:html}
<span itemscope="itemscope"></span>
<meta itemprop="name" content="The Castle" />
{code}

leaving us with an item *not wrapped in an itemscope* (-2 triples) (but would 
be -2 anyway due to ANY23-339) and an EMPTY item scope (+1 triples), bringing 
the total to 39. 

----


The extraction fails (2) by failing to extract a total of 11 triples, *all of 
which* have a predicate IRI equal to "http://www.w3.org/1999/xhtml/vocab#role";.

Of those 11 triples, 1 triple has the object IRI 
"http://www.w3.org/1999/xhtml/vocab#navigation";, 1 triple has the object IRI 
"http://www.w3.org/1999/xhtml/vocab#search";, 1 triple has the object IRI 
"http://www.w3.org/1999/xhtml/vocab#contentinfo";, and 8 triples have the object 
IRI "http://www.w3.org/1999/xhtml/vocab#presentation";.

All of these triples are being overlooked by the html-rdfa11 extractor.

The reason they are being overlooked is, apparently, because the document type 
definition of the document *specifies XHTML+RDFa version 1.0*:
{code:html}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" 
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd";>
{code}
When I either change the document type to XHTML+RDFa version *1.1*:
{code:html}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN" 
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd";>
{code}
or remove the doctype altogether, all 11 triples are extracted as expected.

So, this would be easily fixed just by removing doctypes from all documents.

Comments or insight anyone? 

Question: does anyone know whether or not the rdfa version 1.0 triples 
extracted from a page *are guaranteed to be a subset* of the rdfa version 1.1 
triples extracted?

  was:
When removing the [SAX parsing 
filter|https://github.com/apache/nutch/blob/2934d4384901d4eda0aeecfa281bfbb2d9b9b0c1/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L111-L116]
 from the Nutch Any23 plugin, the test case fails.

Cf. this pull request: https://github.com/apache/nutch/pull/306

There are two test files: (1) 
[microdata_basic.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/microdata_basic.html],
 and (2) 
[BBC_News_Scotland.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/BBC_News_Scotland.html].

----
For (1), the test case expects 39 triples to be extracted. With the SAX 
pre-filter, 39 triples are extracted. Without the SAX pre-filter, only 38 
triples are extracted.

The bad news is, BOTH OF THESE NUMBERS ARE WRONG. *40* triples should be 
extracted.

*Without* the SAX pre-filter, the html-microdata extractor loses 2 triples to 
ANY23-339, bringing the total to 38.

*With* the SAX pre-filter, it sees the *meta* element in the following code:
{code:html}
<span itemscope><meta itemprop="name" content="The Castle"></span>
{code}

And tries to wrap it in a *head* element:
{code:html}
<span itemscope="itemscope"></span>
</body><head><meta itemprop="name" content="The Castle"></meta></head><body>
{code}

Which the Jsoup pre-filter then throws out, as it should:
{code:html}
<span itemscope="itemscope"></span>
<meta itemprop="name" content="The Castle" />
{code}

leaving us with an item *not wrapped in an itemscope* (-2 triples) (but would 
be -2 anyway due to ANY23-339) and an EMPTY item scope (+1 triples), bringing 
the total to 39. 

----


The extraction fails (2) by failing to extract a total of 11 triples, *all of 
which* have a predicate IRI equal to "http://www.w3.org/1999/xhtml/vocab#role";.

Of those 11 triples, 1 triple has the object IRI 
"http://www.w3.org/1999/xhtml/vocab#navigation";, 1 triple has the object IRI 
"http://www.w3.org/1999/xhtml/vocab#search";, 1 triple has the object IRI 
"http://www.w3.org/1999/xhtml/vocab#contentinfo";, and 8 triples have the object 
IRI "http://www.w3.org/1999/xhtml/vocab#presentation";.

All of these triples are being overlooked by the html-rdfa11 extractor.

The reason they are being overlooked is, apparently, because of the document 
type definition of the document, which is:
{code:html}
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" 
"http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd";>
{code}
The problem seems to lie with the PUBLIC id alone.
 Changing the document type to:
 (1)
{code:html}
<!DOCTYPE html SYSTEM "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd";>
{code}
or (2)
{code:html}
<!DOCTYPE html>
{code}
or (3)
{code:html}
          
{code}
results in all 11 triples being extracted as expected.

So, this would be easily fixed just by removing doctypes from all documents.

Comments or insight anyone?


> Any23 extraction does not pass Nutch plugin test
> ------------------------------------------------
>
>                 Key: ANY23-340
>                 URL: https://issues.apache.org/jira/browse/ANY23-340
>             Project: Apache Any23
>          Issue Type: Bug
>          Components: extractors
>    Affects Versions: 2.2
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> When removing the [SAX parsing 
> filter|https://github.com/apache/nutch/blob/2934d4384901d4eda0aeecfa281bfbb2d9b9b0c1/src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java#L111-L116]
>  from the Nutch Any23 plugin, the test case fails.
> Cf. this pull request: https://github.com/apache/nutch/pull/306
> There are two test files: (1) 
> [microdata_basic.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/microdata_basic.html],
>  and (2) 
> [BBC_News_Scotland.html|https://github.com/apache/nutch/blob/master/src/plugin/any23/sample/BBC_News_Scotland.html].
> ----
> For (1), the test case expects 39 triples to be extracted. With the SAX 
> pre-filter, 39 triples are extracted. Without the SAX pre-filter, only 38 
> triples are extracted.
> The bad news is, BOTH OF THESE NUMBERS ARE WRONG. *40* triples should be 
> extracted.
> *Without* the SAX pre-filter, the html-microdata extractor loses 2 triples to 
> ANY23-339, bringing the total to 38.
> *With* the SAX pre-filter, it sees the *meta* element in the following code:
> {code:html}
> <span itemscope><meta itemprop="name" content="The Castle"></span>
> {code}
> And tries to wrap it in a *head* element:
> {code:html}
> <span itemscope="itemscope"></span>
> </body><head><meta itemprop="name" content="The Castle"></meta></head><body>
> {code}
> Which the Jsoup pre-filter then throws out, as it should:
> {code:html}
> <span itemscope="itemscope"></span>
> <meta itemprop="name" content="The Castle" />
> {code}
> leaving us with an item *not wrapped in an itemscope* (-2 triples) (but would 
> be -2 anyway due to ANY23-339) and an EMPTY item scope (+1 triples), bringing 
> the total to 39. 
> ----
> The extraction fails (2) by failing to extract a total of 11 triples, *all of 
> which* have a predicate IRI equal to 
> "http://www.w3.org/1999/xhtml/vocab#role";.
> Of those 11 triples, 1 triple has the object IRI 
> "http://www.w3.org/1999/xhtml/vocab#navigation";, 1 triple has the object IRI 
> "http://www.w3.org/1999/xhtml/vocab#search";, 1 triple has the object IRI 
> "http://www.w3.org/1999/xhtml/vocab#contentinfo";, and 8 triples have the 
> object IRI "http://www.w3.org/1999/xhtml/vocab#presentation";.
> All of these triples are being overlooked by the html-rdfa11 extractor.
> The reason they are being overlooked is, apparently, because the document 
> type definition of the document *specifies XHTML+RDFa version 1.0*:
> {code:html}
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" 
> "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd";>
> {code}
> When I either change the document type to XHTML+RDFa version *1.1*:
> {code:html}
> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN" 
> "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd";>
> {code}
> or remove the doctype altogether, all 11 triples are extracted as expected.
> So, this would be easily fixed just by removing doctypes from all documents.
> Comments or insight anyone? 
> Question: does anyone know whether or not the rdfa version 1.0 triples 
> extracted from a page *are guaranteed to be a subset* of the rdfa version 1.1 
> triples extracted?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ANY23-340) Any23 extraction does not pass Nutch plugin test

Reply via email to