[
https://issues.apache.org/jira/browse/TIKA-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ken Krugler resolved TIKA-2539.
-------------------------------
Resolution: Duplicate
> TagSoup HTML parser is project EOL
> ----------------------------------
>
> Key: TIKA-2539
> URL: https://issues.apache.org/jira/browse/TIKA-2539
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.16, 1.17
> Environment: All
> Reporter: Richard Jones
>
> The TagSoup HTML parser is project EOL, and the last update was to create the
> 1.2.1 version (that Tika references) back in Aug 2011.
> I cannot find any TagSoup forks that are still active but there are many
> alternative (and perhaps better if you believe the reviews and wikipedia
> comparisons) html parsers out there.
> Perhaps the most active is already pulled in by Tika as a transitive
> dependency of edu.ucar:grib, and that is jsoup with over 1,000 usages and
> updates as recent as a few months ago:
> https://mvnrepository.com/artifact/org.jsoup/jsoup
> https://jsoup.org/
> Requesting consideration of moving away from the long EOL'd TagSoup to an
> active and modern HTML parser like jsoup that is already a transitive Tika
> dependency.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)