[
https://issues.apache.org/jira/browse/OPENNLP-1442?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17704732#comment-17704732
]
ASF GitHub Bot commented on OPENNLP-1442:
-----------------------------------------
rzo1 commented on code in PR #523:
URL: https://github.com/apache/opennlp/pull/523#discussion_r1147876849
##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+ public static final String INPUT_IDS = "input_ids";
+ public static final String ATTENTION_MASK = "attention_mask";
+ public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+ protected OrtEnvironment env;
+ protected OrtSession session;
+ protected Tokenizer tokenizer;
+ protected Map<String, Integer> vocab;
+
+ /**
+ * Loads a vocabulary file from disk.
+ * @param vocabFile The vocabulary file.
+ * @return A map of vocabulary words to integer IDs.
+ * @throws IOException Thrown if the vocabulary file cannot be opened and
read.
+ */
+ public Map<String, Integer> loadVocab(File vocabFile) throws IOException {
+
+ final Map<String, Integer> v = new HashMap<>();
+
+ BufferedReader br = new BufferedReader(new
FileReader(vocabFile.getPath()));
Review Comment:
Try-with-resources or Files.readAllLines(...)? We should also define the
encoding (utf8?) instead relying on the plattform locale.
##########
opennlp-dl/src/main/java/opennlp/dl/AbstractDL.java:
##########
@@ -0,0 +1,73 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package opennlp.dl;
+
+import java.io.BufferedReader;
+import java.io.File;
+import java.io.FileReader;
+import java.io.IOException;
+import java.util.HashMap;
+import java.util.Map;
+
+import ai.onnxruntime.OrtEnvironment;
+import ai.onnxruntime.OrtSession;
+
+import opennlp.tools.tokenize.Tokenizer;
+
+/**
+ * Base class for OpenNLP deep-learning classes using ONNX Runtime.
+ */
+public abstract class AbstractDL {
+
+ public static final String INPUT_IDS = "input_ids";
+ public static final String ATTENTION_MASK = "attention_mask";
+ public static final String TOKEN_TYPE_IDS = "token_type_ids";
+
+ protected OrtEnvironment env;
+ protected OrtSession session;
+ protected Tokenizer tokenizer;
+ protected Map<String, Integer> vocab;
+
+ /**
+ * Loads a vocabulary file from disk.
+ * @param vocabFile The vocabulary file.
+ * @return A map of vocabulary words to integer IDs.
+ * @throws IOException Thrown if the vocabulary file cannot be opened and
read.
+ */
+ public Map<String, Integer> loadVocab(File vocabFile) throws IOException {
+
+ final Map<String, Integer> v = new HashMap<>();
+
+ BufferedReader br = new BufferedReader(new
FileReader(vocabFile.getPath()));
+ String line = br.readLine();
Review Comment:
I guess, it is intended to skip the first line?
> Use ONNX Runtime to support sentence-transformers
> -------------------------------------------------
>
> Key: OPENNLP-1442
> URL: https://issues.apache.org/jira/browse/OPENNLP-1442
> Project: OpenNLP
> Issue Type: Task
> Components: Deep Learning
> Reporter: Jeff Zemerick
> Assignee: Jeff Zemerick
> Priority: Major
>
> Use ONNX Runtime to support sentence-transformers. OpenNLP should be able to
> generate embeddings using an ONNX model.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)