Re: [PR] MINIFICPP-2556 Create llama.cpp processor for language model inference [nifi-minifi-cpp]

via GitHub Mon, 07 Apr 2025 05:28:41 -0700


Copilot commented on code in PR #1903:
URL: https://github.com/apache/nifi-minifi-cpp/pull/1903#discussion_r2031123269



##########
PROCESSORS.md:
##########
@@ -1727,7 +1728,42 @@ In the list below, the names of required properties 
appear in bold. Any other pr
 | lastModifiedTime | success      | The timestamp of when the file's content 
changed in the filesystem as 'yyyy-MM-dd'T'HH:mm:ss'.                           
                                                                                
                                                                                
                                                                    |
 | creationTime     | success      | The timestamp of when the file was created 
in the filesystem as 'yyyy-MM-dd'T'HH:mm:ss'.                                   
                                                                                
                                                                                
                                                                  |
 | lastAccessTime   | success      | The timestamp of when the file was 
accessed in the filesystem as 'yyyy-MM-dd'T'HH:mm:ss'.                          
                                                                                
                                                                                
                                                                          |
-| size             | success      | The size of the file in bytes.             
                                                                                
                                                                                
                                                                                
                                                                  |
+| size             | success      | The size of the file in bytes.
+
+                                                                               
        |
+## RunLlamaCppInference
+
+### Description
+
+LlamaCpp processor to use llama.cpp library for running langage model 
inference. The final prompt used for the inference created using the System 
Prompt and Prompt proprerty values and the content of the flowfile referred to 
as input data or flow file content.
+
+### Properties
+
+In the list below, the names of required properties appear in bold. Any other 
properties (not in bold) are considered optional. The table also indicates any 
default values, and whether a property supports the NiFi Expression Language.
+
+| Name                             | Default Value                             
                                                                                
                                                                                
  | Allowable Values | Description                                              
                                                               |
+|----------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|-------------------------------------------------------------------------------------------------------------------------|
+| **Model Path**                   |                                           
                                                                                
                                                                                
  |                  | The filesystem path of the model file in gguf format.    
                                                               |
+| Temperature                      | 0.8                                       
                                                                                
                                                                                
  |                  | The temperature to use for sampling.                     
                                                               |
+| Top K                            | 40                                        
                                                                                
                                                                                
  |                  | Limit the next token selection to the K most probable 
tokens. Set <= 0 value to use vocab size.                         |
+| Top P                            | 0.9                                       
                                                                                
                                                                                
  |                  | Limit the next token selection to a subset of tokens 
with a cumulative probability above a threshold P. 1.0 = disabled. |
+| Min P                            |                                           
                                                                                
                                                                                
  |                  | Sets a minimum base probability threshold for token 
selection. 0.0 = disabled.                                          |
+| **Min Keep**                     | 0                                         
                                                                                
                                                                                
  |                  | If greater than 0, force samplers to return N possible 
tokens at minimum.                                               |
+| **Text Context Size**            | 4096                                      
                                                                                
                                                                                
  |                  | Size of the text context, use 0 to use size set in 
model.                                                               |
+| **Logical Maximum Batch Size**   | 2048                                      
                                                                                
                                                                                
  |                  | Logical maximum batch size that can be submitted to the 
llama.cpp decode function.                                      |
+| **Physical Maximum Batch Size**  | 512                                       
                                                                                
                                                                                
  |                  | Physical maximum batch size.                             
                                                               |
+| **Max Number Of Sequences**      | 1                                         
                                                                                
                                                                                
  |                  | Maximum number of sequences (i.e. distinct states for 
recurrent models).                                                |
+| **Threads For Generation**       | 4                                         
                                                                                
                                                                                
  |                  | Number of threads to use for generation.                 
                                                               |
+| **Threads For Batch Processing** | 4                                         
                                                                                
                                                                                
  |                  | Number of threads to use for batch processing.           
                                                               |
+| **Prompt**                       |                                           
                                                                                
                                                                                
  |                  | The user prompt for the inference.<br/>**Supports 
Expression Language: true**                                           |
+| **System Prompt**                | You are a helpful assisstant. You are 
given a question with some possible input data otherwise called flow file 
content. You are expected to generate a response based on the quiestion and the 
input data. |                  | The system prompt for the inference.           
                                                                         |
+

Review Comment:
   [nitpick] There are spelling mistakes in the system prompt text 
('assisstant' should be 'assistant' and 'quiestion' should be 'question').
   ```suggestion
   | **System Prompt**                | You are a helpful assistant. You are 
given a question with some possible input data otherwise called flow file 
content. You are expected to generate a response based on the question and the 
input data. |                  | The system prompt for the inference.           
                                                                         |
   ```



##########
PROCESSORS.md:
##########
@@ -1727,7 +1728,42 @@ In the list below, the names of required properties 
appear in bold. Any other pr
 | lastModifiedTime | success      | The timestamp of when the file's content 
changed in the filesystem as 'yyyy-MM-dd'T'HH:mm:ss'.                           
                                                                                
                                                                                
                                                                    |
 | creationTime     | success      | The timestamp of when the file was created 
in the filesystem as 'yyyy-MM-dd'T'HH:mm:ss'.                                   
                                                                                
                                                                                
                                                                  |
 | lastAccessTime   | success      | The timestamp of when the file was 
accessed in the filesystem as 'yyyy-MM-dd'T'HH:mm:ss'.                          
                                                                                
                                                                                
                                                                          |
-| size             | success      | The size of the file in bytes.             
                                                                                
                                                                                
                                                                                
                                                                  |
+| size             | success      | The size of the file in bytes.
+
+                                                                               
        |
+## RunLlamaCppInference
+
+### Description
+
+LlamaCpp processor to use llama.cpp library for running langage model 
inference. The final prompt used for the inference created using the System 
Prompt and Prompt proprerty values and the content of the flowfile referred to 
as input data or flow file content.
+

Review Comment:
   The word 'langage' appears to be a typo. Consider replacing it with 
'language'.
   ```suggestion
   LlamaCpp processor to use llama.cpp library for running language model 
inference. The final prompt used for the inference created using the System 
Prompt and Prompt proprerty values and the content of the flowfile referred to 
as input data or flow file content.
   ```



##########
extensions/llamacpp/processors/DefaultLlamaContext.cpp:
##########
@@ -0,0 +1,144 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+#include "DefaultLlamaContext.h"
+#include "Exception.h"
+#include "fmt/format.h"
+
+namespace org::apache::nifi::minifi::extensions::llamacpp::processors {
+
+namespace {
+std::vector<llama_token> tokenizeInput(const llama_vocab* vocab, const 
std::string& input) {
+  int32_t number_of_tokens = gsl::narrow<int32_t>(input.length()) + 2;
+  std::vector<llama_token> tokenized_input(number_of_tokens);
+  number_of_tokens = llama_tokenize(vocab, input.data(), 
gsl::narrow<int32_t>(input.length()), tokenized_input.data(), 
gsl::narrow<int32_t>(tokenized_input.size()), true, true);
+  if (number_of_tokens < 0) {
+    tokenized_input.resize(-number_of_tokens);
+    [[maybe_unused]] int32_t check = llama_tokenize(vocab, input.data(), 
gsl::narrow<int32_t>(input.length()), tokenized_input.data(), 
gsl::narrow<int32_t>(tokenized_input.size()), true, true);
+    gsl_Assert(check == -number_of_tokens);
+  } else {
+    tokenized_input.resize(number_of_tokens);
+  }
+  return tokenized_input;
+}
+}  // namespace
+
+
+DefaultLlamaContext::DefaultLlamaContext(const std::filesystem::path& 
model_path, const LlamaSamplerParams& llama_sampler_params, const 
LlamaContextParams& llama_ctx_params) {
+  llama_backend_init();
+
+  llama_model_ = llama_model_load_from_file(model_path.string().c_str(), 
llama_model_default_params());  // 
NOLINT(cppcoreguidelines-prefer-member-initializer)
+  if (!llama_model_) {
+    throw Exception(ExceptionType::PROCESS_SCHEDULE_EXCEPTION, 
fmt::format("Failed to load model from '{}'", model_path.string()));
+  }
+
+  llama_context_params ctx_params = llama_context_default_params();
+  ctx_params.n_ctx = llama_ctx_params.n_ctx;
+  ctx_params.n_batch = llama_ctx_params.n_batch;
+  ctx_params.n_ubatch = llama_ctx_params.n_ubatch;
+  ctx_params.n_seq_max = llama_ctx_params.n_seq_max;
+  ctx_params.n_threads = llama_ctx_params.n_threads;
+  ctx_params.n_threads_batch = llama_ctx_params.n_threads_batch;
+  ctx_params.flash_attn = false;
+  llama_ctx_ = llama_init_from_model(llama_model_, ctx_params);
+
+  auto sparams = llama_sampler_chain_default_params();
+  llama_sampler_ = llama_sampler_chain_init(sparams);
+
+  if (llama_sampler_params.min_p) {
+    llama_sampler_chain_add(llama_sampler_, 
llama_sampler_init_min_p(*llama_sampler_params.min_p, 
llama_sampler_params.min_keep));
+  }
+  if (llama_sampler_params.top_k) {
+    llama_sampler_chain_add(llama_sampler_, 
llama_sampler_init_top_k(*llama_sampler_params.top_k));
+  }
+  if (llama_sampler_params.top_p) {
+    llama_sampler_chain_add(llama_sampler_, 
llama_sampler_init_top_p(*llama_sampler_params.top_p, 
llama_sampler_params.min_keep));
+  }
+  if (llama_sampler_params.temperature) {
+    llama_sampler_chain_add(llama_sampler_, 
llama_sampler_init_temp(*llama_sampler_params.temperature));
+  }
+  llama_sampler_chain_add(llama_sampler_, 
llama_sampler_init_dist(LLAMA_DEFAULT_SEED));
+}
+
+DefaultLlamaContext::~DefaultLlamaContext() {
+  llama_sampler_free(llama_sampler_);
+  llama_sampler_ = nullptr;
+  llama_free(llama_ctx_);
+  llama_ctx_ = nullptr;
+  llama_model_free(llama_model_);
+  llama_model_ = nullptr;
+  llama_backend_free();
+}
+
+std::string DefaultLlamaContext::applyTemplate(const 
std::vector<LlamaChatMessage>& messages) {
+  std::vector<llama_chat_message> llama_messages;
+  llama_messages.reserve(messages.size());
+  for (auto& msg : messages) {
+    llama_messages.push_back(llama_chat_message{.role = msg.role.c_str(), 
.content = msg.content.c_str()});
+  }
+  std::string text;
+  const char * chat_template = llama_model_chat_template(llama_model_, 
nullptr);
+  int32_t res_size = llama_chat_apply_template(chat_template, 
llama_messages.data(), llama_messages.size(), true, text.data(), 
gsl::narrow<int32_t>(text.size()));

Review Comment:
   The call to llama_chat_apply_template passes an empty string buffer 
(text.data() with text.size() == 0). Allocate or resize the buffer before 
calling the function to safely write the output.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] MINIFICPP-2556 Create llama.cpp processor for language model inference [nifi-minifi-cpp]

Reply via email to