[ 
https://issues.apache.org/jira/browse/ANY23-324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336564#comment-16336564
 ] 

ASF GitHub Bot commented on ANY23-324:
--------------------------------------

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/58
  
    @lewismc And finally, same test as the last, except I changed 
`config.parse(input, url, encoding)` to `config.parse(input, url, null)` to let 
the parser guess the encoding (I also ran the jsoup parser first instead of 
second this time, although the order shouldn't really matter since I'm omitting 
the first 1000 iterations of each test from the results):
    
    JSOUP RESULT:
    <pre>
    total time jsoup: 439636 ms
    </pre>
    
    NEKOHTML RESULT:
    <pre>
    total time neko:  530170 ms
    </pre>


> Replace net.sourceforge.nekohtml with jsoup 
> --------------------------------------------
>
>                 Key: ANY23-324
>                 URL: https://issues.apache.org/jira/browse/ANY23-324
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: core
>            Reporter: Lewis John McGibbney
>            Priority: Major
>             Fix For: 2.2
>
>
> A long standing issue relates to the performance of the existing default 
> [TagSoupParser.java|https://github.com/apache/any23/blob/master/core/src/main/java/org/apache/any23/extractor/html/TagSoupParser.java].
>  There are a number of issues which now relate to limitations in the way 
> nekohtml parses HTML5 for example 
> [ANY23-317|https://issues.apache.org/jira/browse/ANY23-317], 
> [ANY23-273|https://issues.apache.org/jira/browse/ANY23-273], 
> [ANY23-267|https://issues.apache.org/jira/browse/ANY23-267]... there are 
> several others.
> I propose to @Deprecate the TagSoupParser.java implementation for the next 
> release (possibly making it configurable via 
> default-configuration.properties). I also propose to replace it with 
> https://jsoup.org/. AFAIK, Apache Tika also did this several years ago.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to