krickert commented on code in PR #1111: URL: https://github.com/apache/opennlp/pull/1111#discussion_r3523005066
########## opennlp-core/opennlp-runtime/src/main/java/opennlp/tools/util/normalizer/Term.java: ########## @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package opennlp.tools.util.normalizer; + +import java.util.EnumMap; +import java.util.List; + +import opennlp.tools.util.Span; + +/** + * One token as a stack of normalization layers. The {@link #original()} form is the canonical + * source of truth; the other layers are derived, increasingly aggressive {@link Dimension}s tuned + * for matching and search. The dimensions configured on the producing {@link TermAnalyzer} are + * computed eagerly and cached; any other dimension is computed on first request, applied on top of + * the {@link #normalized() configured form}, and then cached. + * + * <p>Because the original is always retained, aggressive folding is safe: a match on a derived layer + * can always be reported in original coordinates through {@link #span()}. Querying a configured + * layer, or {@link #peel() peeling} the last-applied one, is O(1); adding an unconfigured dimension + * costs one transform on first touch and is O(1) thereafter.</p> + * + * <p>Instances are created by {@link TermAnalyzer} and are not thread-safe (the lazy cache is Review Comment: Quick clarification: the `Term` itself is now thread-safe, since the lazy layer cache is a `ConcurrentHashMap` with `putIfAbsent`. The only piece that isn't is the Snowball stemmer, which is stateful by construction and can't be shared across threads. I'll keep this PR scoped to the layered model and enhance the stemmer separately. It isn't a problem in practice: it's one analyzer per thread, and `NormalizationProfile.matchingAnalyzer()` returns a new stemmer on each call. I'll do that enhancement in a separate ticket once the full 1850 epic is done. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
