[ https://issues.apache.org/jira/browse/TIKA-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952838#comment-17952838 ]
Tim Allison commented on TIKA-4419: ----------------------------------- I grepped through our cc-html corpus for self-closing tags. There's plenty of noise. Regexes aren't perfect, nor are the files. I've attached the top 500 self-closing tags. I did a crosswalk with [https://www.w3schools.com/tags,] and came up with a list of about 30 tags. I'll run with that on branch TIKA-4419 and see what we find. > Deal with self-closeable tags handling change in jsoup 1.20.1 > ------------------------------------------------------------- > > Key: TIKA-4419 > URL: https://issues.apache.org/jira/browse/TIKA-4419 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > Attachments: tags-top500.txt > > > On TIKA-4411, [~tilman] found a significant change in behavior for how jsoup > 1.21.1 is handling self-closing tags. We need to figure out how to deal with > this in a reasonable way. > > Ref: > https://issues.apache.org/jira/browse/TIKA-4411?focusedCommentId=17952615&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17952615 -- This message was sent by Atlassian Jira (v8.20.10#820010)