Tim Allison created TIKA-4731:
---------------------------------
Summary: Ongoing improvements to the junk detector
Key: TIKA-4731
URL: https://issues.apache.org/jira/browse/TIKA-4731
Project: Tika
Issue Type: Task
Reporter: Tim Allison
With [https://github.com/apache/tika/pull/2818,] I think we have a decent shape
for the junk detector.
There are several areas for improvement, but I think it is ready to go.
This ticket tracks follow on work, including:
* Smaller model
* Handling pathological code block changes
* Handling candidates with different character counts
* Other items to be discovered in our commoncrawl/govdocs1 corpus?
We have some coverage for the middle two item, but need to address those more
directly.
This work is not a blocker on the 4.0.0-beta-1 release.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)