[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467623#comment-16467623 ] Oliver Lietz commented on SLING-6783: - [~jebailey], tests are fine on my local machine and on [Jenkins|https://builds.apache.org/view/S-Z/view/Sling/job/sling-org-apache-sling-commons-html-1.8/30/console]: {noformat} [INFO] --- maven-failsafe-plugin:2.20.1:integration-test (default) @ org.apache.sling.commons.html --- [INFO] [INFO] --- [INFO] T E S T S [INFO] --- [INFO] Running org.apache.sling.commons.html.it.TagsoupHtmlParserIT [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.485 s - in org.apache.sling.commons.html.it.TagsoupHtmlParserIT [INFO] [INFO] Results: [INFO] [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0 [INFO] [JENKINS] Recording test results {noformat} Can you set the {{timeout}} parameter on the {{Filter}} annotation and see if it fixes your issue? {noformat} @Inject @Filter(value = "(&(dom=tagsoup)(sax=tagsoup))") private HtmlParser htmlParser; {noformat} > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467505#comment-16467505 ] Jason E Bailey commented on SLING-6783: --- [~olli] I'm running into problems with the paxexam test. [INFO] Results: [INFO] [ERROR] Errors: [ERROR] TagsoupHtmlParserIT.testFeaturesConfiguration » IllegalState services vanished... [ERROR] TagsoupHtmlParserIT.testHtmlParser » IllegalState services vanished too fast. [INFO] [ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0 Any idea? I'm not able to deploy. > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466408#comment-16466408 ] Oliver Lietz commented on SLING-6783: - [~jebailey], no – go ahead! And let's discuss modernization of Commons HTML and Rewriter at dev@. > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466388#comment-16466388 ] Jason E Bailey commented on SLING-6783: --- [~olli] do you want to do the release on this? > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16465236#comment-16465236 ] Jason E Bailey commented on SLING-6783: --- [~olli] When I first encountered issues with this and HTML 5 support I did some looking around and discovered that there have been some forks of taglib that supported HTML 5. That's an option. Last I checked jsoup was working on a SAX interface but I don't know the status of that. Changing the API and creating a bridge for that into the re-writer would be useful. It might be time as well to take a look at the re-writer and potentially do a re-write. Honestly that would be my preferred option, look at doing an event based rewriting flow. > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16463833#comment-16463833 ] Oliver Lietz commented on SLING-6783: - [~jebailey], [~klcodanr], I guess we have to change the API of Commons HTML (used in Rewriter) and getting rid of SAX API to use a different parser for HTML5. I tried to plug in [AttoParser|https://www.attoparser.org] and [jsoup|https://jsoup.org] but both do not fit properly. WDYT? > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16462390#comment-16462390 ] Jason E Bailey commented on SLING-6783: --- We should either support them or at least document what is and isn't supported from a features perspective. At this point I would just say documentation, I'm much more interested in finding a way to make this html5 compliant then features that no one has yet asked for. > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422057#comment-16422057 ] Oliver Lietz commented on SLING-6783: - Do we want to support additional parser properties beside {{lexical-handler}}? > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16422055#comment-16422055 ] Oliver Lietz commented on SLING-6783: - [~jebailey], fixed the features/properties mess also. > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings > Javadoc fixed > Prepared for different parsers by renaming HtmlParserImpl and adding > component properties > Configuration improved -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (SLING-6783) Updates for Commons HTML
[ https://issues.apache.org/jira/browse/SLING-6783?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16421804#comment-16421804 ] Oliver Lietz commented on SLING-6783: - [~rombert], [~jebailey] NPE fixed. Please ensure tests are in place when changing existing functionality or adding new. > Updates for Commons HTML > > > Key: SLING-6783 > URL: https://issues.apache.org/jira/browse/SLING-6783 > Project: Sling > Issue Type: Improvement > Components: Commons >Reporter: Jason E Bailey >Assignee: Oliver Lietz >Priority: Minor > Fix For: Commons HTML 1.0.2 > > Attachments: sling.patch > > > Following updates: > Updated tagsoup lib to 1.2.1 which has the following modifications > * DOCTYPE is now recognized even in lower case. > * We make sure to buffer the reader, eliminating a long-standing bug that > would crash on certain inputs, such as & followed by CR+LF. > * The HTML scanner's table is precompiled at run time for efficiency, causing > a 4x speedup on large input documents. > * ]] within a CDATA section no longer causes input to be discarded. > * Remove bogus newline after printing children of the root element. > * Allow the noscript element anywhere, the same as the script element. > * Updated to the 2011 edition of the W3C character entity list. > Additionally: > Updated license with new home page for tagsoup > Updated annotations to OSGi annotations > Added the ability to specify additional features/properties for the parser > Documented available settings -- This message was sent by Atlassian JIRA (v7.6.3#76005)