[ 
https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16935437#comment-16935437
 ] 

Hudson commented on ANY23-443:
------------------------------

SUCCESS: Integrated in Jenkins build Any23-trunk #1667 (See 
[https://builds.apache.org/job/Any23-trunk/1667/])
ANY23-443 improve speed & stability of RDFa extractors (hans: rev 
50cfb2fd7f3112e27c44ab5850117bacda22a679)
* (edit) core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java
* (edit) 
core/src/main/java/org/apache/any23/extractor/rdfa/BaseRDFaExtractor.java
* (add) core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
* (add) core/src/main/java/org/apache/any23/extractor/rdfa/SemarglSink.java
ANY23-443 cleanup (hans: rev d9f1fa4036133158b1a91976d9d05d152c02feaa)
* (edit) 
core/src/test/java/org/apache/any23/extractor/rdfa/RDFa11ExtractorTest.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFa11Extractor.java
* (edit) 
core/src/main/java/org/apache/any23/extractor/rdfa/BaseRDFaExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/SemarglSink.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/RDFaExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdf/BaseRDFExtractor.java
* (edit) core/src/main/java/org/apache/any23/extractor/rdfa/JsoupScanner.java


> Improve efficiency of RDFa Extractor
> ------------------------------------
>
>                 Key: ANY23-443
>                 URL: https://issues.apache.org/jira/browse/ANY23-443
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Hans Brende
>            Priority: Major
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> Our RDFa Extractor is terribly inefficient. 
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly 
> valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and 
> hands off XML streaming events to its underlying XmlSink. 
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
> hands them back to Any23. 
> I propose cutting out all these intermediate steps by simply walking the 
> original jsoup DOM and handing our own XML events directly to semargl's 
> XmlSink, which we will configure to give RDF events directly back to Any23. 
> This will also allow us to get rid of most (or possibly all) of the various 
> HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> ----
> *TL;DR:*
>  
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 
> *becomes*
> {{Jsoup → RdfaParser → Any23}} 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to