[GitHub] [any23] HansBrende commented on issue #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
HansBrende commented on issue #141: ANY23-443 improve speed & stability of RDFa 
extractors
URL: https://github.com/apache/any23/pull/141#issuecomment-533768308
 
 
   @lewismc done. How'd your testing go?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693171
 
 

 ##
 File path: core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java
 ##
 @@ -0,0 +1,159 @@
+package org.apache.any23.extractor.rdfa;
+
+import org.jsoup.nodes.CDataNode;
+import org.jsoup.nodes.Comment;
+import org.jsoup.nodes.Element;
+import org.jsoup.nodes.Node;
+import org.jsoup.nodes.TextNode;
+import org.jsoup.select.NodeVisitor;
+import org.semarglproject.sink.XmlSink;
+import org.xml.sax.SAXException;
+import org.xml.sax.helpers.AttributesImpl;
+import org.xml.sax.helpers.NamespaceSupport;
+
+import java.util.ArrayList;
+
+class JsoupScanner implements NodeVisitor {
+
+private final NamespaceSupport ns = new NamespaceSupport();
+private final AttributesImpl attrs = new AttributesImpl();
+private final String[] nameParts = new String[3];
+
+private final XmlSink handler;
+
+JsoupScanner(XmlSink handler) {
+this.handler = handler;
+}
+
+private static String orEmpty(String str) {
+return str == null ? "" : str;
+}
+
+//private static String orNull(String str) {
 
 Review comment:
   Just remove???


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693774
 
 

 ##
 File path: 
core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java
 ##
 @@ -45,6 +47,7 @@ public ExtractorDescription getDescription() {
 }
 
 @Override
+@Deprecated
 
 Review comment:
   Same here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693850
 
 

 ##
 File path: 
core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
 ##
 @@ -31,12 +32,13 @@
  */
 public class RDFaExtractor extends BaseRDFaExtractor {
 
+@Deprecated
 
 Review comment:
   Same here


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693115
 
 

 ##
 File path: core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java
 ##
 @@ -0,0 +1,159 @@
+package org.apache.any23.extractor.rdfa;
 
 Review comment:
   License header please


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693308
 
 

 ##
 File path: core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java
 ##
 @@ -0,0 +1,159 @@
+package org.apache.any23.extractor.rdfa;
+
+import org.jsoup.nodes.CDataNode;
+import org.jsoup.nodes.Comment;
+import org.jsoup.nodes.Element;
+import org.jsoup.nodes.Node;
+import org.jsoup.nodes.TextNode;
+import org.jsoup.select.NodeVisitor;
+import org.semarglproject.sink.XmlSink;
+import org.xml.sax.SAXException;
+import org.xml.sax.helpers.AttributesImpl;
+import org.xml.sax.helpers.NamespaceSupport;
+
+import java.util.ArrayList;
+
+class JsoupScanner implements NodeVisitor {
+
+private final NamespaceSupport ns = new NamespaceSupport();
+private final AttributesImpl attrs = new AttributesImpl();
+private final String[] nameParts = new String[3];
+
+private final XmlSink handler;
+
+JsoupScanner(XmlSink handler) {
+this.handler = handler;
+}
+
+private static String orEmpty(String str) {
+return str == null ? "" : str;
+}
+
+//private static String orNull(String str) {
+//return "".equals(str) ? null : str;
+//}
+
+private void startElement(Element e) throws SAXException {
+ns.pushContext();
+
+attrs.clear();
+final ArrayList remainingAttrs = new ArrayList<>();
+for (org.jsoup.nodes.Attribute attr : e.attributes()) {
+String name = attr.getKey();
+String value = attr.getValue();
+if (name.startsWith("xmlns")) {
+if (name.length() == 5) {
+ns.declarePrefix("", value);
+handler.startPrefixMapping("", value);
+continue;
+} else if (name.charAt(5) == ':') {
+String localName = name.substring(6);
+ns.declarePrefix(localName, value);
+handler.startPrefixMapping(localName, value);
+continue;
+}
+}
+
+remainingAttrs.add(name);
+remainingAttrs.add(value);
+}
+
+for (int i = 0, len = remainingAttrs.size(); i < len; i += 2) {
+String name = remainingAttrs.get(i);
+String value = remainingAttrs.get(i + 1);
+String[] parts = ns.processName(name, nameParts, true);
+if (parts != null) {
+attrs.addAttribute(orEmpty(parts[0]), orEmpty(parts[1]), 
parts[2], "CDATA", value);
+}
+}
+
+String qName = e.tagName();
+
+String[] parts = ns.processName(qName, nameParts, false);
+if (parts == null) {
+handler.startElement("", "", qName, attrs);
+} else {
+handler.startElement(orEmpty(parts[0]), orEmpty(parts[1]), 
parts[2], attrs);
+}
+
+}
+
+private void endElement(Element e) throws SAXException {
+
+String qName = e.tagName();
+String[] parts = ns.processName(qName, nameParts, false);
+if (parts == null) {
+handler.endElement("", "", qName);
+} else {
+handler.endElement(orEmpty(parts[0]), orEmpty(parts[1]), parts[2]);
+}
+
+for (org.jsoup.nodes.Attribute attr : e.attributes()) {
+String name = attr.getKey();
+if (name.startsWith("xmlns")) {
+if (name.length() == 5) {
+handler.endPrefixMapping("");
+} else if (name.charAt(5) == ':') {
+String localName = name.substring(6);
+handler.endPrefixMapping(localName);
+}
+}
+}
+
+ns.popContext();
+}
+
+private void handleText(String str) throws SAXException {
+handler.characters(str.toCharArray(), 0, str.length());
+}
+
+private void handleComment(String str) throws SAXException {
+handler.comment(str.toCharArray(), 0, str.length());
+}
+
 
 Review comment:
   Remove additional whitespace


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693417
 
 

 ##
 File path: core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java
 ##
 @@ -0,0 +1,159 @@
+package org.apache.any23.extractor.rdfa;
+
+import org.jsoup.nodes.CDataNode;
+import org.jsoup.nodes.Comment;
+import org.jsoup.nodes.Element;
+import org.jsoup.nodes.Node;
+import org.jsoup.nodes.TextNode;
+import org.jsoup.select.NodeVisitor;
+import org.semarglproject.sink.XmlSink;
+import org.xml.sax.SAXException;
+import org.xml.sax.helpers.AttributesImpl;
+import org.xml.sax.helpers.NamespaceSupport;
+
+import java.util.ArrayList;
+
+class JsoupScanner implements NodeVisitor {
+
+private final NamespaceSupport ns = new NamespaceSupport();
+private final AttributesImpl attrs = new AttributesImpl();
+private final String[] nameParts = new String[3];
+
+private final XmlSink handler;
+
+JsoupScanner(XmlSink handler) {
+this.handler = handler;
+}
+
+private static String orEmpty(String str) {
+return str == null ? "" : str;
+}
+
+//private static String orNull(String str) {
+//return "".equals(str) ? null : str;
+//}
+
+private void startElement(Element e) throws SAXException {
+ns.pushContext();
+
+attrs.clear();
+final ArrayList remainingAttrs = new ArrayList<>();
+for (org.jsoup.nodes.Attribute attr : e.attributes()) {
+String name = attr.getKey();
+String value = attr.getValue();
+if (name.startsWith("xmlns")) {
+if (name.length() == 5) {
+ns.declarePrefix("", value);
+handler.startPrefixMapping("", value);
+continue;
+} else if (name.charAt(5) == ':') {
+String localName = name.substring(6);
+ns.declarePrefix(localName, value);
+handler.startPrefixMapping(localName, value);
+continue;
+}
+}
+
+remainingAttrs.add(name);
+remainingAttrs.add(value);
+}
+
+for (int i = 0, len = remainingAttrs.size(); i < len; i += 2) {
+String name = remainingAttrs.get(i);
+String value = remainingAttrs.get(i + 1);
+String[] parts = ns.processName(name, nameParts, true);
+if (parts != null) {
+attrs.addAttribute(orEmpty(parts[0]), orEmpty(parts[1]), 
parts[2], "CDATA", value);
+}
+}
+
+String qName = e.tagName();
+
+String[] parts = ns.processName(qName, nameParts, false);
+if (parts == null) {
+handler.startElement("", "", qName, attrs);
+} else {
+handler.startElement(orEmpty(parts[0]), orEmpty(parts[1]), 
parts[2], attrs);
+}
+
+}
+
+private void endElement(Element e) throws SAXException {
+
+String qName = e.tagName();
+String[] parts = ns.processName(qName, nameParts, false);
+if (parts == null) {
+handler.endElement("", "", qName);
+} else {
+handler.endElement(orEmpty(parts[0]), orEmpty(parts[1]), parts[2]);
+}
+
+for (org.jsoup.nodes.Attribute attr : e.attributes()) {
+String name = attr.getKey();
+if (name.startsWith("xmlns")) {
+if (name.length() == 5) {
+handler.endPrefixMapping("");
+} else if (name.charAt(5) == ':') {
+String localName = name.substring(6);
+handler.endPrefixMapping(localName);
+}
+}
+}
+
+ns.popContext();
+}
+
+private void handleText(String str) throws SAXException {
+handler.characters(str.toCharArray(), 0, str.length());
+}
+
+private void handleComment(String str) throws SAXException {
+handler.comment(str.toCharArray(), 0, str.length());
+}
+
+
+@Override
+public void head(Node node, int depth) {
+try {
+if (node instanceof Element) {
+startElement((Element) node);
+} else if (node instanceof CDataNode) {
+handler.startCDATA();
+handleText(((CDataNode) node).text());
+} else if (node instanceof TextNode) {
+handleText(((TextNode) node).text());
+// TODO support document types
+//} else if (node instanceof DocumentType) {
+//DocumentType dt = (DocumentType)node;
+//handler.startDTD(dt.attr("name"), 
orNull(dt.attr("publicId")), orNull(dt.attr("systemId")));
+} else if (node instanceof Comment) {
+handleComment(((Comment) node).getData());
+}
+} catch (SAXException e) {
+   

[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693877
 
 

 ##
 File path: 
core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
 ##
 @@ -45,6 +47,7 @@ public ExtractorDescription getDescription() {
 }
 
 @Override
+@Deprecated
 
 Review comment:
   Same here


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [any23] lewismc commented on a change in pull request #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
lewismc commented on a change in pull request #141: ANY23-443 improve speed & 
stability of RDFa extractors
URL: https://github.com/apache/any23/pull/141#discussion_r326693916
 
 

 ##
 File path: core/src/main/java/org/apache/any23/extractor/rdfa/SemarglSink.java
 ##
 @@ -0,0 +1,79 @@
+package org.apache.any23.extractor.rdfa;
 
 Review comment:
   License header


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (ANY23-428) RDFa parse issue if vocab not defined with training slash

2019-09-20 Thread David Cockbill (Jira)


[ 
https://issues.apache.org/jira/browse/ANY23-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934387#comment-16934387
 ] 

David Cockbill commented on ANY23-428:
--

Raised: RDFa parse issue if vocab not defined with training slash 
[#53|https://github.com/semarglproject/semargl/issues/53]

> RDFa parse issue if vocab not defined with training slash
> -
>
> Key: ANY23-428
> URL: https://issues.apache.org/jira/browse/ANY23-428
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: David Cockbill
>Priority: Minor
>
> If a RDFa vocab URL is missing a trailing forward slash, then the properties 
> are not expanded correctly.
> For example:
>  
> {code:java}
> https://schema.org; typeof="BreadcrumbList">
> {code}
> rather than:
>  
> {code:java}
> https://schema.org/; typeof="BreadcrumbList">
> {code}
> produces properties that look (in nTriples) as follows:
>  
>  
> {code:java}
>   
>  .
> _:n0  
>  .
> _:n1  
>  .
> {code}
>  
>  
> I'm sure the intention should be to join the properties and vocab with a 
> forward slash.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [any23] HansBrende commented on issue #141: ANY23-443 improve speed & stability of RDFa extractors

2019-09-20 Thread GitBox
HansBrende commented on issue #141: ANY23-443 improve speed & stability of RDFa 
extractors
URL: https://github.com/apache/any23/pull/141#issuecomment-533593053
 
 
   @lewismc any comments on this PR?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (ANY23-428) RDFa parse issue if vocab not defined with training slash

2019-09-20 Thread Lewis John McGibbney (Jira)


[ 
https://issues.apache.org/jira/browse/ANY23-428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16934503#comment-16934503
 ] 

Lewis John McGibbney commented on ANY23-428:


Thanks [~davidcockbill]

> RDFa parse issue if vocab not defined with training slash
> -
>
> Key: ANY23-428
> URL: https://issues.apache.org/jira/browse/ANY23-428
> Project: Apache Any23
>  Issue Type: Bug
>  Components: extractors
>Affects Versions: 2.3
>Reporter: David Cockbill
>Priority: Minor
>
> If a RDFa vocab URL is missing a trailing forward slash, then the properties 
> are not expanded correctly.
> For example:
>  
> {code:java}
> https://schema.org; typeof="BreadcrumbList">
> {code}
> rather than:
>  
> {code:java}
> https://schema.org/; typeof="BreadcrumbList">
> {code}
> produces properties that look (in nTriples) as follows:
>  
>  
> {code:java}
>   
>  .
> _:n0  
>  .
> _:n1  
>  .
> {code}
>  
>  
> I'm sure the intention should be to join the properties and vocab with a 
> forward slash.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)