[ 
https://issues.apache.org/jira/browse/TIKA-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952838#comment-17952838
 ] 

Tim Allison commented on TIKA-4419:
-----------------------------------

I grepped through our cc-html corpus for self-closing tags. There's plenty of 
noise. Regexes aren't perfect, nor are the files. I've attached the top 500 
self-closing tags. 

I did a crosswalk with [https://www.w3schools.com/tags,] and came up with a 
list of about 30 tags. I'll run with that on branch TIKA-4419 and see what we 
find.

> Deal with self-closeable tags handling change in jsoup 1.20.1
> -------------------------------------------------------------
>
>                 Key: TIKA-4419
>                 URL: https://issues.apache.org/jira/browse/TIKA-4419
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: tags-top500.txt
>
>
> On TIKA-4411, [~tilman] found a significant change in behavior for how jsoup 
> 1.21.1 is handling self-closing tags.  We need to figure out how to deal with 
> this in a reasonable way.
>  
> Ref: 
> https://issues.apache.org/jira/browse/TIKA-4411?focusedCommentId=17952615&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17952615



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to