[ 
https://issues.apache.org/jira/browse/TIKA-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083829#comment-18083829
 ] 

ASF GitHub Bot commented on TIKA-4731:
--------------------------------------

tballison commented on code in PR #2839:
URL: https://github.com/apache/tika/pull/2839#discussion_r3310812007


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-integration-tests/src/test/java/org/apache/tika/parser/pkg/PackageParserTest.java:
##########
@@ -33,6 +34,9 @@ public void handleNonUnicodeEntryName() throws Exception {
     }
 
     @Test
+    @Disabled("TIKA-4731: tiny SJIS filenames are not reliably detected after 
removal "
+            + "of the cyclic-repeat hack in ZipParser. Re-enable when 
zip-entry-name "
+            + "detection is fixed (separate from chain rework).")
     public void handleEntryNameWithCharsetShiftJIS() throws Exception {

Review Comment:
   Could improve comment, but this will be fixed when mojibuster+junk detector 
become the default.





> Ongoing improvements to the junk detector
> -----------------------------------------
>
>                 Key: TIKA-4731
>                 URL: https://issues.apache.org/jira/browse/TIKA-4731
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Minor
>
> With [https://github.com/apache/tika/pull/2818,] I think we have a decent 
> shape for the junk detector. 
> There are several areas for improvement, but I think it is ready to go.
> This ticket tracks follow on work, including:
>  * Smaller model
>  * Handling pathological code block changes
>  * Handling candidates with different character counts
>  * Other items to be discovered in our commoncrawl/govdocs1 corpus?
> We have some coverage for the middle two item, but need to address those more 
> directly.
> This work is not a blocker on the 4.0.0-beta-1 release.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to