[jira] [Created] (ANY23-351) NullPointerException in HCardExtractor
Hans Brende created ANY23-351: - Summary: NullPointerException in HCardExtractor Key: ANY23-351 URL: https://issues.apache.org/jira/browse/ANY23-351 Project: Apache Any23 Issue Type: Bug Components: microformats Affects Versions: 2.3 Reporter: Hans Brende When extracting from the url: https://cambridgewi.com/make-cambridge-home/char/V/ I get the following NullPointerException, which kills the entire extraction process: {code} java.lang.NullPointerException at org.apache.any23.extractor.html.HTMLDocument.readUrlField(HTMLDocument.java:119) at org.apache.any23.extractor.html.HTMLDocument.getPluralUrlField(HTMLDocument.java:288) at org.apache.any23.extractor.html.HCardExtractor.addLogo(HCardExtractor.java:267) at org.apache.any23.extractor.html.HCardExtractor.extractEntity(HCardExtractor.java:130) at org.apache.any23.extractor.html.EntityBasedMicroformatExtractor.extract(EntityBasedMicroformatExtractor.java:66) at org.apache.any23.extractor.html.MicroformatExtractor.run(MicroformatExtractor.java:102) at org.apache.any23.extractor.html.MicroformatExtractor.run(MicroformatExtractor.java:44) at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:480) at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:259) at org.apache.any23.Any23.extract(Any23.java:302) at org.apache.any23.Any23.extract(Any23.java:437) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ANY23-349) MicrodataExtractor errors for links that are telephone numbers
[ https://issues.apache.org/jira/browse/ANY23-349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hans Brende updated ANY23-349: -- Summary: MicrodataExtractor errors for links that are telephone numbers (was: MicrodataExtractor ) > MicrodataExtractor errors for links that are telephone numbers > -- > > Key: ANY23-349 > URL: https://issues.apache.org/jira/browse/ANY23-349 > Project: Apache Any23 > Issue Type: Bug > Components: microdata >Affects Versions: 2.3 >Reporter: Hans Brende >Priority: Major > > I get the following error when extracting from > http://clubzone.com/ontario-los-angeles/places/ > This error kills the whole extraction process. > {code} > Exception in thread "main" org.apache.any23.extractor.ExtractionException: > Error while processing on subject '_:node1cb6a1b5jx5' the itemProp: '{ > "xpath" : > "/HTML[1]/BODY[1]/DIV[3]/DIV[3]/DIV[1]/DIV[1]/SECTION[1]/ARTICLE[2]/DIV[2]/DIV[1]/P[2]/A[1]", > "name" : "telephone", "value" : { "content" : "tel:(909) 484-2020", "type" : > "Link" } }' > at > org.apache.any23.extractor.microdata.MicrodataExtractor.processType(MicrodataExtractor.java:442) > at > org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:116) > at > org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:60) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ANY23-350) RDFParseException: "icon" must be followed by ' = ' character
Hans Brende created ANY23-350: - Summary: RDFParseException: "icon" must be followed by ' = ' character Key: ANY23-350 URL: https://issues.apache.org/jira/browse/ANY23-350 Project: Apache Any23 Issue Type: Bug Components: extractors Affects Versions: 2.3 Reporter: Hans Brende I get the following error log when extracting from: https://gunshowtrader.com/gunshows/arkansas/ Haven't had time to debug this. {code} ERROR org.apache.any23.extractor.rdf.BaseRDFExtractor - Error while parsing RDF document. org.eclipse.rdf4j.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 4536; Attribute name "icon" associated with an element type "link" must be followed by the ' = ' character. at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:111) at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:95) at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:158) at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:57) at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:471) at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:259) at org.apache.any23.Any23.extract(Any23.java:302) at org.apache.any23.Any23.extract(Any23.java:437) at com.utownapp.crawl.tripledb.Triples.lambda$extractTriples$0(Triples.java:146) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.semarglproject.rdf.ParseException: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 4536; Attribute name "icon" associated with an element type "link" must be followed by the ' = ' character. at org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1141) at org.semarglproject.source.XmlSource.process(XmlSource.java:50) at org.semarglproject.source.StreamProcessor.processInternal(StreamProcessor.java:87) at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:167) at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:154) at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:109) ... 12 more Caused by: org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 4536; Attribute name "icon" associated with an element type "link" must be followed by the ' = ' character. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.semarglproject.source.XmlSource.process(XmlSource.java:48) ... 16 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ANY23-349) MicrodataExtractor
Hans Brende created ANY23-349: - Summary: MicrodataExtractor Key: ANY23-349 URL: https://issues.apache.org/jira/browse/ANY23-349 Project: Apache Any23 Issue Type: Bug Components: microdata Affects Versions: 2.3 Reporter: Hans Brende I get the following error when extracting from http://clubzone.com/ontario-los-angeles/places/ This error kills the whole extraction process. {code} Exception in thread "main" org.apache.any23.extractor.ExtractionException: Error while processing on subject '_:node1cb6a1b5jx5' the itemProp: '{ "xpath" : "/HTML[1]/BODY[1]/DIV[3]/DIV[3]/DIV[1]/DIV[1]/SECTION[1]/ARTICLE[2]/DIV[2]/DIV[1]/P[2]/A[1]", "name" : "telephone", "value" : { "content" : "tel:(909) 484-2020", "type" : "Link" } }' at org.apache.any23.extractor.microdata.MicrodataExtractor.processType(MicrodataExtractor.java:442) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:116) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:60) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ANY23-348) IllegalArgumentException in MicrodataExtractor
[ https://issues.apache.org/jira/browse/ANY23-348?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hans Brende updated ANY23-348: -- Description: I get the following IllegalArgumentException when extracting from http://movies.eventful.com/theaters-showtimes/canyon-meadows-/T0-001-05891-8 I also get it when extracting from: http://eventful.com/performers This IllegalArgumentException kills the whole extraction process. Haven't had time to debug this. {code} Exception in thread "main" java.lang.IllegalArgumentException: Invalid type '', must be a valid URL. at org.apache.any23.extractor.microdata.ItemScope.(ItemScope.java:81) at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:509) at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:196) at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:213) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:89) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:60) {code} was: I get the following IllegalArgumentException when extracting from http://movies.eventful.com/theaters-showtimes/canyon-meadows-/T0-001-05891-8 This IllegalArgumentException kills the whole extraction process. Haven't had time to debug this. {code} Exception in thread "main" java.lang.IllegalArgumentException: Invalid type '', must be a valid URL. at org.apache.any23.extractor.microdata.ItemScope.(ItemScope.java:81) at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:509) at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:196) at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:213) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:89) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:60) {code} > IllegalArgumentException in MicrodataExtractor > -- > > Key: ANY23-348 > URL: https://issues.apache.org/jira/browse/ANY23-348 > Project: Apache Any23 > Issue Type: Bug > Components: microdata >Affects Versions: 2.3 >Reporter: Hans Brende >Priority: Major > > I get the following IllegalArgumentException when extracting from > http://movies.eventful.com/theaters-showtimes/canyon-meadows-/T0-001-05891-8 > I also get it when extracting from: http://eventful.com/performers > This IllegalArgumentException kills the whole extraction process. > Haven't had time to debug this. > {code} > Exception in thread "main" java.lang.IllegalArgumentException: Invalid type > '', must be a valid URL. > at > org.apache.any23.extractor.microdata.ItemScope.(ItemScope.java:81) > at > org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:509) > at > org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:196) > at > org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:213) > at > org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:89) > at > org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:60) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ANY23-348) IllegalArgumentException in MicrodataExtractor
Hans Brende created ANY23-348: - Summary: IllegalArgumentException in MicrodataExtractor Key: ANY23-348 URL: https://issues.apache.org/jira/browse/ANY23-348 Project: Apache Any23 Issue Type: Bug Components: microdata Affects Versions: 2.3 Reporter: Hans Brende I get the following IllegalArgumentException when extracting from http://movies.eventful.com/theaters-showtimes/canyon-meadows-/T0-001-05891-8 This IllegalArgumentException kills the whole extraction process. Haven't had time to debug this. {code} Exception in thread "main" java.lang.IllegalArgumentException: Invalid type '', must be a valid URL. at org.apache.any23.extractor.microdata.ItemScope.(ItemScope.java:81) at org.apache.any23.extractor.microdata.MicrodataParser.getItemScope(MicrodataParser.java:509) at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:196) at org.apache.any23.extractor.microdata.MicrodataParser.getMicrodata(MicrodataParser.java:213) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:89) at org.apache.any23.extractor.microdata.MicrodataExtractor.run(MicrodataExtractor.java:60) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ANY23-347) RDFParseException: the prefix "pw" is not bound
Hans Brende created ANY23-347: - Summary: RDFParseException: the prefix "pw" is not bound Key: ANY23-347 URL: https://issues.apache.org/jira/browse/ANY23-347 Project: Apache Any23 Issue Type: Bug Components: extractors Affects Versions: 2.3 Reporter: Hans Brende I get the following error log for the site: https://69.agendaculturel.fr/concert/ Haven't had time to debug this. {code} ERROR org.apache.any23.extractor.rdf.BaseRDFExtractor - Error while parsing RDF document. org.eclipse.rdf4j.rio.RDFParseException: org.xml.sax.SAXParseException; lineNumber: 165; columnNumber: 101; The prefix "pw" for attribute "pw:twitter-via" associated with an element type "div" is not bound. at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:111) at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:95) at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:158) at org.apache.any23.extractor.rdf.BaseRDFExtractor.run(BaseRDFExtractor.java:57) at org.apache.any23.extractor.SingleDocumentExtraction.runExtractor(SingleDocumentExtraction.java:471) at org.apache.any23.extractor.SingleDocumentExtraction.run(SingleDocumentExtraction.java:259) at org.apache.any23.Any23.extract(Any23.java:302) at org.apache.any23.Any23.extract(Any23.java:437) at com.utownapp.crawl.tripledb.Triples.lambda$extractTriples$0(Triples.java:146) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Caused by: org.semarglproject.rdf.ParseException: org.xml.sax.SAXParseException; lineNumber: 165; columnNumber: 101; The prefix "pw" for attribute "pw:twitter-via" associated with an element type "div" is not bound. at org.semarglproject.rdf.rdfa.RdfaParser.processException(RdfaParser.java:1141) at org.semarglproject.source.XmlSource.process(XmlSource.java:50) at org.semarglproject.source.StreamProcessor.processInternal(StreamProcessor.java:87) at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:167) at org.semarglproject.source.BaseStreamProcessor.process(BaseStreamProcessor.java:154) at org.semarglproject.rdf4j.rdf.rdfa.RDF4JRDFaParser.parse(RDF4JRDFaParser.java:109) ... 12 more Caused by: org.xml.sax.SAXParseException; lineNumber: 165; columnNumber: 101; The prefix "pw" for attribute "pw:twitter-via" associated with an element type "div" is not bound. at org.apache.xerces.parsers.AbstractSAXParser.parse(Unknown Source) at org.semarglproject.source.XmlSource.process(XmlSource.java:48) ... 16 more {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ANY23-346) rdf4j version 2.3.x contains a regression: we need to switch back to version 2.2.4
[ https://issues.apache.org/jira/browse/ANY23-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hans Brende updated ANY23-346: -- Component/s: extractors > rdf4j version 2.3.x contains a regression: we need to switch back to version > 2.2.4 > -- > > Key: ANY23-346 > URL: https://issues.apache.org/jira/browse/ANY23-346 > Project: Apache Any23 > Issue Type: Bug > Components: extractors >Affects Versions: 2.3 >Reporter: Hans Brende >Assignee: Hans Brende >Priority: Critical > Fix For: 2.3 > > > The new rdf4j v. 2.3.x's ParsedIRI class does not parse some valid urls > correctly. See https://github.com/eclipse/rdf4j/issues/1017 > This affects the workings of their entire project. We'll have to switch back > to version 2.2.4 until this regression is fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ANY23-346) rdf4j version 2.3.x contains a regression: we need to switch back to version 2.2.4
[ https://issues.apache.org/jira/browse/ANY23-346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hans Brende resolved ANY23-346. --- Resolution: Fixed Assignee: Hans Brende Fix Version/s: 2.3 > rdf4j version 2.3.x contains a regression: we need to switch back to version > 2.2.4 > -- > > Key: ANY23-346 > URL: https://issues.apache.org/jira/browse/ANY23-346 > Project: Apache Any23 > Issue Type: Bug > Components: extractors >Affects Versions: 2.3 >Reporter: Hans Brende >Assignee: Hans Brende >Priority: Critical > Fix For: 2.3 > > > The new rdf4j v. 2.3.x's ParsedIRI class does not parse some valid urls > correctly. See https://github.com/eclipse/rdf4j/issues/1017 > This affects the workings of their entire project. We'll have to switch back > to version 2.2.4 until this regression is fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ANY23-346) rdf4j version 2.3.x contains a regression: we need to switch back to version 2.2.4
[ https://issues.apache.org/jira/browse/ANY23-346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438872#comment-16438872 ] ASF GitHub Bot commented on ANY23-346: -- Github user asfgit closed the pull request at: https://github.com/apache/any23/pull/80 > rdf4j version 2.3.x contains a regression: we need to switch back to version > 2.2.4 > -- > > Key: ANY23-346 > URL: https://issues.apache.org/jira/browse/ANY23-346 > Project: Apache Any23 > Issue Type: Bug >Reporter: Hans Brende >Priority: Critical > > The new rdf4j v. 2.3.x's ParsedIRI class does not parse some valid urls > correctly. See https://github.com/eclipse/rdf4j/issues/1017 > This affects the workings of their entire project. We'll have to switch back > to version 2.2.4 until this regression is fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[GitHub] any23 pull request #80: ANY23-346 reverted to rdf4j 2.2.4 due to regression ...
Github user asfgit closed the pull request at: https://github.com/apache/any23/pull/80 ---
[GitHub] any23 pull request #80: ANY23-346 reverted to rdf4j 2.2.4 due to regression ...
GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/80 ANY23-346 reverted to rdf4j 2.2.4 due to regression in 2.3 See https://github.com/eclipse/rdf4j/issues/1017 mvn clean test -> all tests pass As soon as this regression is fixed in rdf4j, we should revert back, because the ParsedIRI class looks pretty cool and apparently fixes some issues with URI resolving present in the java.net.URI class. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-346 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/80.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #80 commit ca17e7700b82ac69498268ce17c703aa6371ef3b Author: HansDate: 2018-04-15T23:13:40Z ANY23-346 reverted to rdf4j 2.2.4 due to regression in 2.3 ---
[jira] [Commented] (ANY23-346) rdf4j version 2.3.x contains a regression: we need to switch back to version 2.2.4
[ https://issues.apache.org/jira/browse/ANY23-346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16438870#comment-16438870 ] ASF GitHub Bot commented on ANY23-346: -- GitHub user HansBrende opened a pull request: https://github.com/apache/any23/pull/80 ANY23-346 reverted to rdf4j 2.2.4 due to regression in 2.3 See https://github.com/eclipse/rdf4j/issues/1017 mvn clean test -> all tests pass As soon as this regression is fixed in rdf4j, we should revert back, because the ParsedIRI class looks pretty cool and apparently fixes some issues with URI resolving present in the java.net.URI class. You can merge this pull request into a Git repository by running: $ git pull https://github.com/HansBrende/any23 ANY23-346 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/any23/pull/80.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #80 commit ca17e7700b82ac69498268ce17c703aa6371ef3b Author: HansDate: 2018-04-15T23:13:40Z ANY23-346 reverted to rdf4j 2.2.4 due to regression in 2.3 > rdf4j version 2.3.x contains a regression: we need to switch back to version > 2.2.4 > -- > > Key: ANY23-346 > URL: https://issues.apache.org/jira/browse/ANY23-346 > Project: Apache Any23 > Issue Type: Bug >Reporter: Hans Brende >Priority: Critical > > The new rdf4j v. 2.3.x's ParsedIRI class does not parse some valid urls > correctly. See https://github.com/eclipse/rdf4j/issues/1017 > This affects the workings of their entire project. We'll have to switch back > to version 2.2.4 until this regression is fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ANY23-346) rdf4j version 2.3.x contains a regression: we need to switch back to version 2.2.4
Hans Brende created ANY23-346: - Summary: rdf4j version 2.3.x contains a regression: we need to switch back to version 2.2.4 Key: ANY23-346 URL: https://issues.apache.org/jira/browse/ANY23-346 Project: Apache Any23 Issue Type: Bug Reporter: Hans Brende The new rdf4j v. 2.3.x's ParsedIRI class does not parse some valid urls correctly. See https://github.com/eclipse/rdf4j/issues/1017 This affects the workings of their entire project. We'll have to switch back to version 2.2.4 until this regression is fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)