nfsantos commented on code in PR #923:
URL: https://github.com/apache/jackrabbit-oak/pull/923#discussion_r1194912202


##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -160,21 +175,67 @@ private static <FD> LinkedHashMap<String, FD> 
loadFilters(NodeState state,
         Tree tree = TreeFactory.createReadOnlyTree(state);
         for (Tree t : tree.getChildren()) {
             NodeState child = state.getChildNode(t.getName());
-            Class<? extends AbstractAnalysisFactory> tff = 
lookup.apply(t.getName());
+
             String name;
+            List<String> content = null;
+            List<Function<Map<String, Object>, Map<String, Object>>> 
transformers;

Review Comment:
   Same suggestion, use a functional interface instead of Function<>



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -81,16 +88,24 @@ public class ElasticCustomAnalyzer {
     );
 
     /*
-     * Mappings for lucene options not available anymore to supported elastic 
counterparts
+     * In some cases lucene cannot be used for parsing. This map contains 
pluggable transformers to execute on specific
+     * filter keys.
      */
-    private static final Map<Class<? extends AbstractAnalysisFactory>, 
Map<String, String>> CONFIGURATION_MAPPING;
+    private static final Map<String, Function<String, Stream<String>>> 
CONTENT_TRANSFORMERS;
 
     static {
-        CONFIGURATION_MAPPING = new LinkedHashMap<>();
-        CONFIGURATION_MAPPING.put(AbstractWordsFileFilterFactory.class, 
Map.of("words", "stopwords"));
-        CONFIGURATION_MAPPING.put(MappingCharFilterFactory.class, 
Map.of("mapping", "mappings"));
+        CONTENT_TRANSFORMERS = new LinkedHashMap<>();
+        CONTENT_TRANSFORMERS.put("mapping", line -> {
+            if (line.length() == 0 || line.startsWith("#")) {
+                return Stream.empty();
+            } else {
+                return Stream.of(line.replaceAll("\"", ""));
+            }
+        });

Review Comment:
   Use an immutable map instead.



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -81,16 +88,24 @@ public class ElasticCustomAnalyzer {
     );
 
     /*
-     * Mappings for lucene options not available anymore to supported elastic 
counterparts
+     * In some cases lucene cannot be used for parsing. This map contains 
pluggable transformers to execute on specific
+     * filter keys.
      */
-    private static final Map<Class<? extends AbstractAnalysisFactory>, 
Map<String, String>> CONFIGURATION_MAPPING;
+    private static final Map<String, Function<String, Stream<String>>> 
CONTENT_TRANSFORMERS;

Review Comment:
   Suggestion to improve readability: define a SAM interface with descriptive 
names for the interface and the method, and use it instead of the definition 
`Function<>`  



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -160,21 +175,67 @@ private static <FD> LinkedHashMap<String, FD> 
loadFilters(NodeState state,
         Tree tree = TreeFactory.createReadOnlyTree(state);
         for (Tree t : tree.getChildren()) {
             NodeState child = state.getChildNode(t.getName());
-            Class<? extends AbstractAnalysisFactory> tff = 
lookup.apply(t.getName());
+
             String name;
+            List<String> content = null;
+            List<Function<Map<String, Object>, Map<String, Object>>> 
transformers;
             try {
+                Class<? extends AbstractAnalysisFactory> tff = 
lookup.apply(t.getName());
+
+                Map<String, String> changes =
+                        LUCENE_VERSIONS.entrySet().stream()
+                                .filter(k -> k.getKey().isAssignableFrom(tff))
+                                .map(Map.Entry::getValue)
+                                .findFirst().orElseGet(Collections::emptyMap);
+                Map<String, String> luceneArgs = 
StreamSupport.stream(child.getProperties().spliterator(), false)
+                        .filter(ps -> {
+                            String value = changes.get(ps.getName());
+                            return value == null || value.length() > 0;
+                        })
+                        .collect(Collectors.toMap(PropertyState::getName, ps 
-> ps.getValue(Type.STRING)));
+
+                AbstractAnalysisFactory luceneFactory = 
tff.getConstructor(Map.class).newInstance(luceneArgs);
+                if (luceneFactory instanceof AbstractWordsFileFilterFactory) {
+                    AbstractWordsFileFilterFactory wordsFF = 
((AbstractWordsFileFilterFactory) luceneFactory);
+                    // this will parse/load the content handling different 
formats, comments, etc
+                    wordsFF.inform(new NodeStateResourceLoader(child));
+                    content = wordsFF.getWords().stream().map(w ->
+                            new String(new String(((char[]) 
w)).getBytes(StandardCharsets.UTF_8))
+                    ).collect(Collectors.toList());
+                }
+
                 name = normalize((String) tff.getField("NAME").get(null));
+                transformers = LUCENE_ELASTIC_TRANSFORMERS.entrySet().stream()
+                        .filter(k -> k.getKey().isAssignableFrom(tff))
+                        .map(Map.Entry::getValue)
+                        .collect(Collectors.toList());
             } catch (Exception e) {
-                LOG.warn("unable to get the filter name using reflection. Try 
using the normalized node name", e);
+                LOG.warn("unable introspect lucene internal factories to 
perform transformations. " +
+                        "Current configuration will be used: {}", 
e.getMessage());

Review Comment:
   Maybe log more information about the exception? The message doesn't even 
include the exception type. Could be useful to also have the stack trace.



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzerMappings.java:
##########
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.jackrabbit.oak.plugins.index.elastic.index;
+
+import org.apache.lucene.analysis.charfilter.MappingCharFilterFactory;
+import org.apache.lucene.analysis.cjk.CJKBigramFilterFactory;
+import org.apache.lucene.analysis.commongrams.CommonGramsFilterFactory;
+import org.apache.lucene.analysis.en.AbstractWordsFileFilterFactory;
+import org.apache.lucene.analysis.minhash.MinHashFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.KeepWordFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.KeywordMarkerFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.LengthFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.LimitTokenCountFilterFactory;
+import org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory;
+import org.apache.lucene.analysis.ngram.NGramFilterFactory;
+import org.apache.lucene.analysis.pattern.PatternCaptureGroupFilterFactory;
+import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilterFactory;
+import org.apache.lucene.analysis.shingle.ShingleFilterFactory;
+import org.apache.lucene.analysis.synonym.SynonymFilterFactory;
+import org.apache.lucene.analysis.util.AbstractAnalysisFactory;
+import org.apache.lucene.analysis.util.ElisionFilterFactory;
+
+import java.util.ArrayList;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+public class ElasticCustomAnalyzerMappings {
+
+    /*
+     * Compatibility mappings between the different lucene versions
+     */
+    protected static final Map<Class<? extends AbstractAnalysisFactory>, 
Map<String, String>> LUCENE_VERSIONS;
+
+    static {
+        LUCENE_VERSIONS = new LinkedHashMap<>();
+        LUCENE_VERSIONS.put(CommonGramsFilterFactory.class, 
Map.of("query_mode", ""));
+        LUCENE_VERSIONS.put(AbstractWordsFileFilterFactory.class, 
Map.of("enablePositionIncrements", ""));
+        LUCENE_VERSIONS.put(SynonymFilterFactory.class, Map.of("lenient", ""));
+        LUCENE_VERSIONS.put(EdgeNGramFilterFactory.class, 
Map.of("preserveOriginal", ""));
+        LUCENE_VERSIONS.put(LengthFilterFactory.class, 
Map.of("enablePositionIncrements", ""));
+        LUCENE_VERSIONS.put(NGramFilterFactory.class, Map.of("keepShortTerm", 
"", "preserveOriginal", ""));
+    }

Review Comment:
   Convert to immutable list.



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzerMappings.java:
##########
@@ -0,0 +1,201 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.jackrabbit.oak.plugins.index.elastic.index;
+
+import org.apache.lucene.analysis.charfilter.MappingCharFilterFactory;
+import org.apache.lucene.analysis.cjk.CJKBigramFilterFactory;
+import org.apache.lucene.analysis.commongrams.CommonGramsFilterFactory;
+import org.apache.lucene.analysis.en.AbstractWordsFileFilterFactory;
+import org.apache.lucene.analysis.minhash.MinHashFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.KeepWordFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.KeywordMarkerFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.LengthFilterFactory;
+import org.apache.lucene.analysis.miscellaneous.LimitTokenCountFilterFactory;
+import org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory;
+import org.apache.lucene.analysis.ngram.NGramFilterFactory;
+import org.apache.lucene.analysis.pattern.PatternCaptureGroupFilterFactory;
+import org.apache.lucene.analysis.payloads.DelimitedPayloadTokenFilterFactory;
+import org.apache.lucene.analysis.shingle.ShingleFilterFactory;
+import org.apache.lucene.analysis.synonym.SynonymFilterFactory;
+import org.apache.lucene.analysis.util.AbstractAnalysisFactory;
+import org.apache.lucene.analysis.util.ElisionFilterFactory;
+
+import java.util.ArrayList;
+import java.util.LinkedHashMap;
+import java.util.List;
+import java.util.Map;
+import java.util.function.BiFunction;
+import java.util.function.Function;
+
+public class ElasticCustomAnalyzerMappings {
+
+    /*
+     * Compatibility mappings between the different lucene versions
+     */
+    protected static final Map<Class<? extends AbstractAnalysisFactory>, 
Map<String, String>> LUCENE_VERSIONS;
+
+    static {
+        LUCENE_VERSIONS = new LinkedHashMap<>();
+        LUCENE_VERSIONS.put(CommonGramsFilterFactory.class, 
Map.of("query_mode", ""));
+        LUCENE_VERSIONS.put(AbstractWordsFileFilterFactory.class, 
Map.of("enablePositionIncrements", ""));
+        LUCENE_VERSIONS.put(SynonymFilterFactory.class, Map.of("lenient", ""));
+        LUCENE_VERSIONS.put(EdgeNGramFilterFactory.class, 
Map.of("preserveOriginal", ""));
+        LUCENE_VERSIONS.put(LengthFilterFactory.class, 
Map.of("enablePositionIncrements", ""));
+        LUCENE_VERSIONS.put(NGramFilterFactory.class, Map.of("keepShortTerm", 
"", "preserveOriginal", ""));
+    }
+
+    /*
+     * Transform function for lucene parameters to elastic
+     */
+    protected static final Map<Class<? extends AbstractAnalysisFactory>, 
Function<Map<String, Object>, Map<String, Object>>> LUCENE_ELASTIC_TRANSFORMERS;
+
+    static {
+        BiFunction<Map<String, Object>, Map<String, String>, Map<String, 
Object>> reKey = (luceneParams, keys) -> {

Review Comment:
   Functional interface? It's hard to understand what this function is supposed 
to do.



##########
oak-search-elastic/src/main/java/org/apache/jackrabbit/oak/plugins/index/elastic/index/ElasticCustomAnalyzer.java:
##########
@@ -218,30 +281,62 @@ private static String normalize(String value) {
     }
 
     private static Map<String, Object> convertNodeState(NodeState state) {
-        return convertNodeState(state, Collections.emptyMap());
+        return convertNodeState(state, List.of(), List.of());
     }
 
-    private static Map<String, Object> convertNodeState(NodeState state, 
Map<String, String> mapping) {
-        return 
StreamSupport.stream(Spliterators.spliteratorUnknownSize(state.getProperties().iterator(),
 Spliterator.ORDERED), false)
+    private static Map<String, Object> convertNodeState(NodeState state, 
List<Function<Map<String, Object>, Map<String, Object>>> transformers,
+                                                        List<String> 
preloadedContent) {
+        Map<String, Object> luceneParams = 
StreamSupport.stream(Spliterators.spliteratorUnknownSize(state.getProperties().iterator(),
 Spliterator.ORDERED), false)

Review Comment:
   This is a personal preference, but I find it simper and more readable to 
just use the pre-Java 8 iteration style of `while(iter.hasNext())...` than this 
somewhat complex way of converting an iterator to a stream.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to