[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

2016-04-02 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15223021#comment-15223021
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user asfgit closed the pull request at:

https://github.com/apache/any23/pull/17


> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --
>
> Key: ANY23-247
> URL: https://issues.apache.org/jira/browse/ANY23-247
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.2
>
>
> In the following markup
> {code}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> http://www.w3.org/1999/xhtml; 
> xmlns:og="http://opengraphprotocol.org/schema/; 
> xmlns:fb="http://www.facebook.com/2008/fbml; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product;>
> 
> 
> 
> 
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

2016-03-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15215394#comment-15215394
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-202702530
  
@ansell any further comments here? I will try to get to work on the larger 
issue this week. 


> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --
>
> Key: ANY23-247
> URL: https://issues.apache.org/jira/browse/ANY23-247
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.2
>
>
> In the following markup
> {code}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> http://www.w3.org/1999/xhtml; 
> xmlns:og="http://opengraphprotocol.org/schema/; 
> xmlns:fb="http://www.facebook.com/2008/fbml; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product;>
> 
> 
> 
> 
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

2016-03-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212568#comment-15212568
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-201573785
  
ACK @ansell , master branch is unstable with the following test failures


https://builds.apache.org/view/A-D/view/Any23/job/Any23-trunk/1466/#showFailuresLink

If you can reproduce this locally (or up until your test build fails within 
core with 3 failing tests) then that is the 'expected' behaviour right now. The 
Microdata test is directly related to the issue we are now discussing here. 

This issue is the most pressing for Any23 right now, IMHO it is a complete 
blocker to us releasing Any23 1.2


> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --
>
> Key: ANY23-247
> URL: https://issues.apache.org/jira/browse/ANY23-247
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.2
>
>
> In the following markup
> {code}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> http://www.w3.org/1999/xhtml; 
> xmlns:og="http://opengraphprotocol.org/schema/; 
> xmlns:fb="http://www.facebook.com/2008/fbml; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product;>
> 
> 
> 
> 
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

2016-03-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212539#comment-15212539
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user ansell commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-201564662
  
I tested this pull request and it has a few failing tests for me. I know 
that the Any23 master hasn't been perfect for its test record (mostly due to 
unreliable remote queries), but I haven't been watching recently to know which 
tests are expected to fail.


> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --
>
> Key: ANY23-247
> URL: https://issues.apache.org/jira/browse/ANY23-247
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.2
>
>
> In the following markup
> {code}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> http://www.w3.org/1999/xhtml; 
> xmlns:og="http://opengraphprotocol.org/schema/; 
> xmlns:fb="http://www.facebook.com/2008/fbml; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product;>
> 
> 
> 
> 
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

2016-03-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212500#comment-15212500
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-201550510
  
I agree. Jumping through this in the debugged made me think the same.
I think it is different if Any23 is to be a PURE implementation... But that
is clearly not the case. Any23 fits in best when it can be used to extract
semantics from any old crap input that it is fed. Parsers and extractors
*should not* fail when there is a piece of crap input HTML. Currently,
that's exactly what happens and it is extremely limiting.

I would like to propose that this PR is committed to master as is, we then
open a brand new issue which acts exactly your comments refactoring out
content extractor and reusing the input stream which has been fixed, etc.

Any thoughts Peter? Thanks fr quick response.

On Friday, March 25, 2016, Peter Ansell  wrote:

> The system does seem a little too complex for our purposes and isn't
> usable because of that.
>
> Removing generics would be the first step IMO as there are too many
> rawtypes definitions which indicate generics are being used badly.
>
> ContentExtractor may be able to be completely removed instead of being
> refitted into the process after that and the parser should always be set 
to
> parse as far as practical for our purposes.
>
> It is a little strange that there isn't a buffered, markable, InputStream
> provided for all of the steps to reuse as necessary rather than pushing a
> raw InputStream or other source into different extractors.
>
> —
> You are receiving this because you authored the thread.
> Reply to this email directly or view it on GitHub
> 
>


-- 
*Lewis*



> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --
>
> Key: ANY23-247
> URL: https://issues.apache.org/jira/browse/ANY23-247
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.2
>
>
> In the following markup
> {code}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> http://www.w3.org/1999/xhtml; 
> xmlns:og="http://opengraphprotocol.org/schema/; 
> xmlns:fb="http://www.facebook.com/2008/fbml; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product;>
> 
> 
> 
> 
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

2016-03-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212495#comment-15212495
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user ansell commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-201545776
  
The system does seem a little too complex for our purposes and isn't usable 
because of that.

Removing generics would be the first step IMO as there are too many 
rawtypes definitions which indicate generics are being used badly.

ContentExtractor may be able to be completely removed instead of being 
refitted into the process after that and the parser should always be set to 
parse as far as practical for our purposes.

It is a little strange that there isn't a buffered, markable, InputStream 
provided for all of the steps to reuse as necessary rather than pushing a raw 
InputStream or other source into different extractors.


> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --
>
> Key: ANY23-247
> URL: https://issues.apache.org/jira/browse/ANY23-247
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.2
>
>
> In the following markup
> {code}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> http://www.w3.org/1999/xhtml; 
> xmlns:og="http://opengraphprotocol.org/schema/; 
> xmlns:fb="http://www.facebook.com/2008/fbml; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product;>
> 
> 
> 
> 
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name "itemscope" associated with an element type "html" must be followed by the ' = ' character.

2016-03-25 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15212476#comment-15212476
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-201537723
  
hi @ansell OK I've added in the correct rule and fix as well as a test to 
verify that empty itemscope values are identified and fixed. 
Whilst debugging this however the core issue persists. Reasoning for this 
is that ```RDFa11Extractor extends BaseRDFExtractor``` which inherits the 
[parser function inputstream 
parameter](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java#L105).
 This input stream is not the 'fixed' steam but the raw document. 
The only way I can think around this is for us to 
 * refactor the 
[RDFa1.1Extractor](https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java)
 such that it extends 
[TagSoupDomExtractor](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L60)
 as oppose to (eventually) the 
[ContentExtractor](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44),
 or
 * undertake a mass refactoring which essentially removes the 
[ContentExtractor](https://github.com/apache/any23/blob/63ba2fc82966cc056a2e475af849154d0dfdcf93/api/src/main/java/org/apache/any23/extractor/Extractor.java#L44)
 altogether... this would provide us with a much more flexible and adaptable 
extraction framework IMHO.

What do you think?


> FIX Attribute name "itemscope" associated with an element type "html" must be 
> followed by the ' = ' character.
> --
>
> Key: ANY23-247
> URL: https://issues.apache.org/jira/browse/ANY23-247
> Project: Apache Any23
>  Issue Type: Improvement
>Affects Versions: 1.1
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
> Fix For: 1.2
>
>
> In the following markup
> {code}
>  "http://www.w3.org/TR/html4/loose.dtd;>
> http://www.w3.org/1999/xhtml; 
> xmlns:og="http://opengraphprotocol.org/schema/; 
> xmlns:fb="http://www.facebook.com/2008/fbml; version="HTML+RDFa 1.0" 
> xml:lang="en" itemscope itemtype="http://schema.org/Product;>
> 
> 
> 
> 
> ...
> {code}
> Due to the absence of any subsequent value for *itemscope*, we get the 
> following error in our web server logs
> {code}
> [Fatal Error] :2:185: Attribute name "itemscope" associated with an element 
> type "html" must be followed by the ' = ' character.
> {code}
> Although the markup semantics are incorrect, Any23 should simply perform a 
> check for the itemscope value being null, if this is the case then add *=""*, 
> there is a precedent for us doing something like this before, I just cant 
> find the ticket right now!
> The code we need to add is present within either 
> core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
> core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-04-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486295#comment-14486295
 ] 

Lewis John McGibbney commented on ANY23-247:


Any ideas about this [~p_ansell]. Did you manage to debug where and when rules 
and/or fixes are invoked and when they are applied?


 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-04-08 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486435#comment-14486435
 ] 

Lewis John McGibbney commented on ANY23-247:


Ack




-- 
*Lewis*


 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-04-08 Thread Peter Ansell (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14486412#comment-14486412
 ] 

Peter Ansell commented on ANY23-247:


I think the only place they are defined right now is in 
DefaultValidator.loadDefaultRules, and the only place they are applied is in 
DefaultValidator.validate. You may need to create an instance of Rule to match 
documents that have 'itemscope' and then use the Fix implementation that you 
have written already to patch them with 'itemscope=itemscope'. You pair the 
Rule with the Fix in DefaultValidator.loadDefaultRules

Ideally we would have a FixFactory interface that is implemented for each 
combination of a Rule with an optional Fix. The FixFactory can then be 
registered as a service using META-INF/services, to avoid having them hardcoded 
into DefaultValidator.loadDefaultRules.

 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388395#comment-14388395
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-88056267
  
By the way @ansell, an observation is that whenever we make an attempt to 
infer the document language, we never succeed. It is always returns null. On 
every single occasion I get back null.


 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388384#comment-14388384
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-88051079
  
@ansell done, the branch is now 2 ahead of master


 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-31 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14388397#comment-14388397
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on the pull request:

https://github.com/apache/any23/pull/17#issuecomment-88056599
  
When I debug this, a good place to set a breakpoint is at line 

https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/SingleDocumentExtraction.java#L253
The parse fails on the RDFA1.1 parser with the following error... still
```
  [Fatal Error] :23:15: Attribute name itemscope associated with an 
element type div must be followed by the ' = ' character.
[2015-03-31 04:46:46,618]DEBUG544766[main] - 
org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:488)
 - html-rdfa11: Error while parsing RDF document.
```


 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14386954#comment-14386954
 ] 

ASF GitHub Bot commented on ANY23-247:
--

GitHub user lewismc opened a pull request:

https://github.com/apache/any23/pull/17

ANY23-247 FIX Attribute name itemscope associated with an element type html 
must be followed by the ' = ' character.

Hi Folks,
PR which fixes this issue locally. I am getting clean builds now again 
after introducing this new MissingItemscopeAttributeValueRule class.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/lewismc/any23 ANY23-247

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/any23/pull/17.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17


commit 5ac2307a0245f06f07cbdbe300bc8608f73b1ba1
Author: Lewis John McGibbney lewis.j.mcgibb...@jpl.nasa.gov
Date:   2015-03-30T16:43:25Z

ANY23-247 FIX Attribute name itemscope associated with an element type html 
must be followed by the ' = ' character.




 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387600#comment-14387600
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/17#discussion_r27442017
  
--- Diff: 
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.any23.validator.rule;
+
+import org.apache.any23.validator.DOMDocument;
+import org.apache.any23.validator.Fix;
+import org.apache.any23.validator.Rule;
+import org.apache.any23.validator.RuleContext;
+
+/**
+ * This fixes missing attribute values for the 'itemscope' attribute, 
+ * which was be associated with div nodes.
+ * Typically when such a snippet of XHTML is fed through the 
+ * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
+ * subsequently to Sesame's {@link 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
+ * it will result in the following behavior. 
+ * pre
+ * {@code
+ * [Fatal Error] :23:15: Attribute name itemscope associated with an 
element type div must be followed by the ' = ' character.
+ * }
+ * /pre
+ * This Fix is an effort to mitigate against that happening. 
+ *
+ */
+public class MissingItemscopeAttributeValueRule implements Fix {
--- End diff --

I looked for it being registered during a single document extraction. It 
was my understanding that validation and fixes are registered and active as 
part of the extraction parameters agenda? If a vanilla SingleDocumentExtration 
is invoked... as per the Any23Test then by default the Fixes and Validations 
are activated.


 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387707#comment-14387707
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/17#discussion_r27444925
  
--- Diff: 
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.any23.validator.rule;
+
+import org.apache.any23.validator.DOMDocument;
+import org.apache.any23.validator.Fix;
+import org.apache.any23.validator.Rule;
+import org.apache.any23.validator.RuleContext;
+
+/**
+ * This fixes missing attribute values for the 'itemscope' attribute, 
+ * which was be associated with div nodes.
+ * Typically when such a snippet of XHTML is fed through the 
+ * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
+ * subsequently to Sesame's {@link 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
+ * it will result in the following behavior. 
+ * pre
+ * {@code
+ * [Fatal Error] :23:15: Attribute name itemscope associated with an 
element type div must be followed by the ' = ' character.
+ * }
+ * /pre
+ * This Fix is an effort to mitigate against that happening. 
+ *
+ */
+public class MissingItemscopeAttributeValueRule implements Fix {
--- End diff --

Everything I've uploaded to the patch is what I have coded. There is no
other black magic on my end to get this invoked.

On Monday, March 30, 2015, Lewis John Mcgibbney lewis.mcgibb...@gmail.com
wrote:

 Ack

 On Monday, March 30, 2015, Peter Ansell notificati...@github.com
 javascript:_e(%7B%7D,'cvml','notificati...@github.com'); wrote:

 In
 
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
 https://github.com/apache/any23/pull/17#discussion_r27442717:

  +/**
  + * This fixes missing attribute values for the 'itemscope' attribute,
  + * which was be associated with div nodes.
  + * Typically when such a snippet of XHTML is fed through the
  + * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
  + * subsequently to Sesame's {@link 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
  + * it will result in the following behavior.
  + * pre
  + * {@code
  + * [Fatal Error] :23:15: Attribute name itemscope associated with 
an element type div must be followed by the ' = ' character.
  + * }
  + * /pre
  + * This Fix is an effort to mitigate against that happening.
  + *
  + */
  +public class MissingItemscopeAttributeValueRule implements Fix {

 It may be done using a classpath scan. I will look into it further.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/any23/pull/17/files#r27442717.



 --
 *Lewis*



-- 
*Lewis*



 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; 

[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387704#comment-14387704
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user lewismc commented on a diff in the pull request:

https://github.com/apache/any23/pull/17#discussion_r27444885
  
--- Diff: 
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.any23.validator.rule;
+
+import org.apache.any23.validator.DOMDocument;
+import org.apache.any23.validator.Fix;
+import org.apache.any23.validator.Rule;
+import org.apache.any23.validator.RuleContext;
+
+/**
+ * This fixes missing attribute values for the 'itemscope' attribute, 
+ * which was be associated with div nodes.
+ * Typically when such a snippet of XHTML is fed through the 
+ * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
+ * subsequently to Sesame's {@link 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
+ * it will result in the following behavior. 
+ * pre
+ * {@code
+ * [Fatal Error] :23:15: Attribute name itemscope associated with an 
element type div must be followed by the ' = ' character.
+ * }
+ * /pre
+ * This Fix is an effort to mitigate against that happening. 
+ *
+ */
+public class MissingItemscopeAttributeValueRule implements Fix {
--- End diff --

Ack

On Monday, March 30, 2015, Peter Ansell notificati...@github.com wrote:

 In
 
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
 https://github.com/apache/any23/pull/17#discussion_r27442717:

  +/**
  + * This fixes missing attribute values for the 'itemscope' attribute,
  + * which was be associated with div nodes.
  + * Typically when such a snippet of XHTML is fed through the
  + * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
  + * subsequently to Sesame's {@link 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
  + * it will result in the following behavior.
  + * pre
  + * {@code
  + * [Fatal Error] :23:15: Attribute name itemscope associated with an 
element type div must be followed by the ' = ' character.
  + * }
  + * /pre
  + * This Fix is an effort to mitigate against that happening.
  + *
  + */
  +public class MissingItemscopeAttributeValueRule implements Fix {

 It may be done using a classpath scan. I will look into it further.

 —
 Reply to this email directly or view it on GitHub
 https://github.com/apache/any23/pull/17/files#r27442717.



-- 
*Lewis*



 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be 

[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-30 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14387757#comment-14387757
 ] 

ASF GitHub Bot commented on ANY23-247:
--

Github user ansell commented on a diff in the pull request:

https://github.com/apache/any23/pull/17#discussion_r27446443
  
--- Diff: 
core/src/main/java/org/apache/any23/validator/rule/MissingItemscopeAttributeValueRule.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the License); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *  http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an AS IS BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.any23.validator.rule;
+
+import org.apache.any23.validator.DOMDocument;
+import org.apache.any23.validator.Fix;
+import org.apache.any23.validator.Rule;
+import org.apache.any23.validator.RuleContext;
+
+/**
+ * This fixes missing attribute values for the 'itemscope' attribute, 
+ * which was be associated with div nodes.
+ * Typically when such a snippet of XHTML is fed through the 
+ * {@link org.apache.any23.extractor.rdfa.RDFa11Extractor}, and
+ * subsequently to Sesame's {@link 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser},
+ * it will result in the following behavior. 
+ * pre
+ * {@code
+ * [Fatal Error] :23:15: Attribute name itemscope associated with an 
element type div must be followed by the ' = ' character.
+ * }
+ * /pre
+ * This Fix is an effort to mitigate against that happening. 
+ *
+ */
+public class MissingItemscopeAttributeValueRule implements Fix {
--- End diff --

There is a hardcoded set in DefaultValidator.loadDefaultRules, but I can't 
find any place that is doing classpath scanning there.

I also do not understand the relationship between Rule and Fix. In the 
DefaultValidator, there are either Rule, or Rule+Fix, not just a Fix like you 
have here.

I will look into it further when I get a chance.


 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ANY23-247) FIX Attribute name itemscope associated with an element type html must be followed by the ' = ' character.

2015-03-26 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/ANY23-247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14381978#comment-14381978
 ] 

Lewis John McGibbney commented on ANY23-247:


An example of a failing test for this issue
{code}
org.apache.any23.Any23Test.testMicrodataSupport
Failing for the past 6 builds (Since Unstable#1309 )
Took 0.43 sec.
Error Message

Error while parsing RDF document.

Stacktrace

org.apache.any23.extractor.ExtractionException: Error while parsing RDF 
document.
at 
com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1236)
at org.semarglproject.source.XmlSource.process(XmlSource.java:48)
at 
org.semarglproject.source.StreamProcessor.processInternal(StreamProcessor.java:87)
at 
org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:167)
at 
org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:154)
at 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:109)
at 
org.semarglproject.sesame.rdf.rdfa.SesameRDFaParser.parse(SesameRDFaParser.java:95)
at 
org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:105)
at 
org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:41)
at 
org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:462)
at 
org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:254)
at org.apache.any23.Any23.extract(Any23.java:298)
at org.apache.any23.Any23.extract(Any23.java:433)
at org.apache.any23.Any23.extract(Any23.java:347)
at org.apache.any23.Any23Test.detectAndExtract(Any23Test.java:559)
at 
org.apache.any23.Any23Test.assertExtractorActivation(Any23Test.java:590)
at org.apache.any23.Any23Test.testMicrodataSupport(Any23Test.java:484)

Standard Output

[2015-03-26 02:01:37,665] INFO  4947[main] - 
org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:221)
 - Processing http://host.com/path
  

Standard Error

[Fatal Error] :23:15: Attribute name itemscope associated with an element 
type div must be followed by the ' = ' character.

{code}

 FIX Attribute name itemscope associated with an element type html must be 
 followed by the ' = ' character.
 --

 Key: ANY23-247
 URL: https://issues.apache.org/jira/browse/ANY23-247
 Project: Apache Any23
  Issue Type: Improvement
Affects Versions: 1.1
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
 Fix For: 1.3


 In the following markup
 {code}
 !DOCTYPE HTML PUBLIC -//W3C//DTD HTML 4.01 Transitional//EN 
 http://www.w3.org/TR/html4/loose.dtd;
 html xmlns=http://www.w3.org/1999/xhtml; 
 xmlns:og=http://opengraphprotocol.org/schema/; 
 xmlns:fb=http://www.facebook.com/2008/fbml; version=HTML+RDFa 1.0 
 xml:lang=en itemscope itemtype=http://schema.org/Product;
 head
 meta http-equiv=Content-Type content=text/html; charset=UTF-8
 meta http-equiv=X-UA-Compatible content=IE=edge /
 meta name=generator content=ToolTwist /
 ...
 {code}
 Due to the absence of any subsequent value for *itemscope*, we get the 
 following error in our web server logs
 {code}
 [Fatal Error] :2:185: Attribute name itemscope associated with an element 
 type html must be followed by the ' = ' character.
 {code}
 Although the markup semantics are incorrect, Any23 should simply perform a 
 check for the itemscope value being null, if this is the case then add *=*, 
 there is a precedent for us doing something like this before, I just cant 
 find the ticket right now!
 The code we need to add is present within either 
 core/src/main/java/org/apache/any23/extractor/microdata/ItemScope.java
 core/src/main/java/org/apache/any23/extractor/microdata/MicrodataParser.java



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)