Tim Allison created TIKA-4731:
---------------------------------

             Summary: Ongoing improvements to the junk detector
                 Key: TIKA-4731
                 URL: https://issues.apache.org/jira/browse/TIKA-4731
             Project: Tika
          Issue Type: Task
            Reporter: Tim Allison


With [https://github.com/apache/tika/pull/2818,] I think we have a decent shape 
for the junk detector. 

There are several areas for improvement, but I think it is ready to go.

This ticket tracks follow on work, including:
 * Smaller model
 * Handling pathological code block changes
 * Handling candidates with different character counts
 * Other items to be discovered in our commoncrawl/govdocs1 corpus?

We have some coverage for the middle two item, but need to address those more 
directly.

This work is not a blocker on the 4.0.0-beta-1 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to