Vikas-l opened a new issue, #6265:
URL: https://github.com/apache/hop/issues/6265

   ### Apache Hop version?
   
   hop 2.16.0
   
   ### Java version?
   
   java version "17.0.13" 2024-10-15 LTS
   
   ### Operating system
   
   Windows
   
   ### What happened?
   
   We are observing a noticeable performance difference in the **Filter Rows** 
transform when comparing Apache Hop and Pentaho Kettle under the same 
conditions.
   
   For a pipeline processing approximately **100,000 rows**, the Filter Rows 
transform in:
   - **Apache Hop 2.16.0** takes around **6 seconds**
   - **Pentaho Kettle** takes around **2 seconds**
   
   Both pipelines use:
   - The same input data
   - The same filter condition
   - The same execution environment and JVM
   
   This difference is consistently reproducible across multiple runs.
   
   ### Use case
   The pipeline consists of:
   1. A Text File Input step generating ~100,000 rows
   2. A Constant step (no complex expressions)
   3. A single Filter Rows transform with a simple condition (e.g. numeric or 
string comparison)
   4. A dummy/output step
   
   ### Run config --- local engine 
   {
     "engineRunConfiguration": {
       "Local": {
         "feedback_size": "50000",
         "sample_size": "100",
         "sample_type_in_gui": "Last",
         "wait_time": "10",
         "rowset_size": "50000",
         "safe_mode": false,
         "show_feedback": false,
         "topo_sort": false,
         "gather_metrics": false,
         "transactional": false
       }
     },
     "defaultSelection": true,
     "configurationVariables": [],
     "name": "local",
     "description": "",
     "dataProfile": "",
     "executionInfoLocationName": ""
   }
   
   ### Additional observation:
   If we replace the Filter Rows transform in Apache Hop with a Java Filter 
step, the execution time becomes almost identical to Pentaho Kettle (~2 
seconds). This indicates the performance issue is specific to the Filter Rows 
transform in Hop.
   
   ### Impact
   This performance gap becomes significant when processing larger datasets and 
when migrating existing Pentaho Kettle transformations to Apache Hop.
   
   ### Screenshots / Evidence
   Below screenshots show the execution time of the Filter Rows transform:
   - Apache Hop pipeline metrics showing ~6 seconds for Filter Rows
   <img width="1907" height="697" alt="Image" 
src="https://github.com/user-attachments/assets/0ebeb02d-34b1-4de0-954b-45fe71eff9ac";
 />
    
   - (Optional) Pentaho Kettle metrics showing ~2 seconds for the same step
   <img width="1906" height="1015" alt="Image" 
src="https://github.com/user-attachments/assets/eb9c3be5-e797-4c0a-925f-13f71bdd06a7";
 />
   
   ### Attachments
   Apache Hop pipeline file using the Filter Rows step: [Google Drive 
Link](https://drive.google.com/drive/folders/1KUz1ufSe_TiVgJlmcBppA5PKzl5tl4zB?usp=drive_link)
   
   Apache Hop pipeline file using the Java Filter step: [Google Drive 
Link](https://drive.google.com/drive/folders/1SEr9pHMLrR7X9occqgYOTE7d0lqmaBuW?usp=drive_link)
   
   ### Additional Notes
   Using Java Filter in Hop gives performance similar to Kettle
   Suggest investigating potential inefficiencies in the Filter Rows transform 
implementation in Hop
   
   ### Issue Priority
   
   Priority: 1
   
   ### Issue Component
   
   Component: Transforms


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to