[jira] [Updated] (ANY23-443) Improve efficiency of RDFa Extractor

Hans Brende (Jira) Sat, 14 Sep 2019 20:04:39 -0700


     [ 
https://issues.apache.org/jira/browse/ANY23-443?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hans Brende updated ANY23-443:
------------------------------
    Description: 
Our RDFa Extractor is terribly inefficient. 

1st, we parse the html "tag soup" input stream into a DOM using Jsoup
2nd, we transform the DOM back into an input stream, containing strictly valid 
XML to avoid errors in the underlying semargl parser
3rd, the underlying semargl parser resurrects this input stream as XML and 
hands off XML streaming events to its underlying XmlSink. 
4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
hands them back to Any23. 

I propose cutting out all these intermediate steps by simply walking the 
original jsoup DOM and handing our own XML events directly to semargl's 
XmlSink, which we will configure to give RDF events directly back to Any23. 

This will also allow us to get rid of most (or possibly all) of the various 
HTML-to-XML "fixups" we had to implement to prevent extraction failures.

----

*TL;DR:*
 
{{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 

*becomes*

{{Jsoup → RdfaParser → Any23}} 


  was:
Our RDFa Extractor is terribly inefficient. 

1st, we parse the html "tag soup" input stream into a DOM using Jsoup
2nd, we transform the DOM back into an input stream, containing strictly valid 
XML to avoid errors in the underlying semargl parser
3rd, the underlying semargl parser resurrects this input stream as XML and 
hands off XML streaming events to its underlying XmlSink. 
4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
hands them back to Any23. 

I propose cutting out all these intermediate steps by simply walking the 
original jsoup DOM and handing our own XML events directly to semargl's 
XmlSink, which we will configure to give RDF events directly back to Any23. 

This will also allow us to get rid of most (or possibly all) of the various 
HTML-to-XML "fixups" we had to implement to prevent extraction failures.


> Improve efficiency of RDFa Extractor
> ------------------------------------
>
>                 Key: ANY23-443
>                 URL: https://issues.apache.org/jira/browse/ANY23-443
>             Project: Apache Any23
>          Issue Type: Improvement
>            Reporter: Hans Brende
>            Priority: Major
>
> Our RDFa Extractor is terribly inefficient. 
> 1st, we parse the html "tag soup" input stream into a DOM using Jsoup
> 2nd, we transform the DOM back into an input stream, containing strictly 
> valid XML to avoid errors in the underlying semargl parser
> 3rd, the underlying semargl parser resurrects this input stream as XML and 
> hands off XML streaming events to its underlying XmlSink. 
> 4th, semargl's XmlSink hands its own RDF events back to RDF4J, which in turn 
> hands them back to Any23. 
> I propose cutting out all these intermediate steps by simply walking the 
> original jsoup DOM and handing our own XML events directly to semargl's 
> XmlSink, which we will configure to give RDF events directly back to Any23. 
> This will also allow us to get rid of most (or possibly all) of the various 
> HTML-to-XML "fixups" we had to implement to prevent extraction failures.
> ----
> *TL;DR:*
>  
> {{Jsoup → InputStream → RDF4J → XMLReader → RdfaParser → RDF4J → Any23}} 
> *becomes*
> {{Jsoup → RdfaParser → Any23}} 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

[jira] [Updated] (ANY23-443) Improve efficiency of RDFa Extractor

Reply via email to