[
https://issues.apache.org/jira/browse/TIKA-4731?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083828#comment-18083828
]
ASF GitHub Bot commented on TIKA-4731:
--------------------------------------
tballison commented on code in PR #2839:
URL: https://github.com/apache/tika/pull/2839#discussion_r3310808816
##########
tika-encoding-detectors/tika-encoding-detector-mojibuster/src/main/java/org/apache/tika/ml/chardetect/AdaptiveProbe.java:
##########
@@ -0,0 +1,80 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.tika.ml.chardetect;
+
+import java.io.IOException;
+import java.util.Arrays;
+
+import org.apache.commons.io.IOUtils;
+
+import org.apache.tika.io.TikaInputStream;
+
+/**
+ * Reads an encoding-detection probe sized by <em>content</em>, not raw bytes.
+ *
+ * <p>A fixed raw probe (e.g. 16 KB) starves the detectors when a page
+ * leads with a large {@code <head>}/inline script: after tag-stripping there's
+ * little body text, and the bytes that distinguish charsets sit past the
+ * window. This grows the read until ~{@code contentTarget} bytes of
+ * tag-stripped content are present, capped at {@code rawCap} raw bytes.
+ *
+ * <p>Text-rich pages stop early (~one chunk); markup-heavy pages read deeper,
+ * bounded by the cap. Multi-byte encodings (UTF-16/32) register no ASCII tags
+ * so they stop at {@code contentTarget} raw bytes.
+ */
+public final class AdaptiveProbe {
+
+ /** Default body-content target. */
+ public static final int DEFAULT_CONTENT_TARGET = 16384;
+ /** Default hard ceiling on raw bytes read. */
+ public static final int DEFAULT_RAW_CAP = 524288;
+
+ private AdaptiveProbe() {
+ }
+
+ /**
+ * Reads from {@code tis} (mark/reset preserved) until tag-stripped content
+ * reaches {@code contentTarget}, the raw read reaches {@code rawCap}, or
+ * EOF — whichever first. Returns the raw bytes read.
+ */
+ public static byte[] read(TikaInputStream tis, int contentTarget, int
rawCap)
+ throws IOException {
+ tis.mark(rawCap);
+ try {
+ byte[] buf = new byte[rawCap];
+ byte[] stripDst = new byte[rawCap];
+ int total = 0;
+ while (total < rawCap) {
+ int want = Math.min(rawCap - total, contentTarget);
+ int n = IOUtils.read(tis, buf, total, want);
+ total += n;
+ HtmlByteStripper.Result r =
+ HtmlByteStripper.stripTags(buf, 0, total, stripDst, 0);
+ int content = r.tagCount > 0 ? r.length : total;
+ if (content >= contentTarget || n < want) {
+ break; // enough body text, or EOF
+ }
+ }
Review Comment:
IOUtils.read returns [0, length] never -1.
> Ongoing improvements to the junk detector
> -----------------------------------------
>
> Key: TIKA-4731
> URL: https://issues.apache.org/jira/browse/TIKA-4731
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Minor
>
> With [https://github.com/apache/tika/pull/2818,] I think we have a decent
> shape for the junk detector.
> There are several areas for improvement, but I think it is ready to go.
> This ticket tracks follow on work, including:
> * Smaller model
> * Handling pathological code block changes
> * Handling candidates with different character counts
> * Other items to be discovered in our commoncrawl/govdocs1 corpus?
> We have some coverage for the middle two item, but need to address those more
> directly.
> This work is not a blocker on the 4.0.0-beta-1 release.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)