[ 
https://issues.apache.org/jira/browse/TIKA-4419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17952838#comment-17952838
 ] 

Tim Allison edited comment on TIKA-4419 at 5/20/25 12:21 PM:
-------------------------------------------------------------

I grepped through our cc-html corpus for self-closing tags. There's plenty of 
noise. Regexes aren't perfect, nor are the files. I've attached the top 500 
self-closing tags. 

I did a crosswalk with [https://www.w3schools.com/tags,] and came up with a 
list of about 30 tags. I'll run with that on branch 
[branch_3x-TIKA-4419-v2|https://github.com/apache/tika/tree/branch_3x-TIKA-4419-v2]
 and see what we find.

If anyone has any recs for handling this more elegantly, please let me know.


was (Author: talli...@mitre.org):
I grepped through our cc-html corpus for self-closing tags. There's plenty of 
noise. Regexes aren't perfect, nor are the files. I've attached the top 500 
self-closing tags. 

I did a crosswalk with [https://www.w3schools.com/tags,] and came up with a 
list of about 30 tags. I'll run with that on branch TIKA-4419 and see what we 
find.

> Deal with self-closeable tags handling change in jsoup 1.20.1
> -------------------------------------------------------------
>
>                 Key: TIKA-4419
>                 URL: https://issues.apache.org/jira/browse/TIKA-4419
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>         Attachments: tags-top500.txt
>
>
> On TIKA-4411, [~tilman] found a significant change in behavior for how jsoup 
> 1.21.1 is handling self-closing tags.  We need to figure out how to deal with 
> this in a reasonable way.
>  
> Ref: 
> https://issues.apache.org/jira/browse/TIKA-4411?focusedCommentId=17952615&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17952615



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to