Vikas-l opened a new issue, #6265:
URL: https://github.com/apache/hop/issues/6265
### Apache Hop version?
hop 2.16.0
### Java version?
java version "17.0.13" 2024-10-15 LTS
### Operating system
Windows
### What happened?
We are observing a noticeable performance difference in the **Filter Rows**
transform when comparing Apache Hop and Pentaho Kettle under the same
conditions.
For a pipeline processing approximately **100,000 rows**, the Filter Rows
transform in:
- **Apache Hop 2.16.0** takes around **6 seconds**
- **Pentaho Kettle** takes around **2 seconds**
Both pipelines use:
- The same input data
- The same filter condition
- The same execution environment and JVM
This difference is consistently reproducible across multiple runs.
### Use case
The pipeline consists of:
1. A Text File Input step generating ~100,000 rows
2. A Constant step (no complex expressions)
3. A single Filter Rows transform with a simple condition (e.g. numeric or
string comparison)
4. A dummy/output step
### Run config --- local engine
{
"engineRunConfiguration": {
"Local": {
"feedback_size": "50000",
"sample_size": "100",
"sample_type_in_gui": "Last",
"wait_time": "10",
"rowset_size": "50000",
"safe_mode": false,
"show_feedback": false,
"topo_sort": false,
"gather_metrics": false,
"transactional": false
}
},
"defaultSelection": true,
"configurationVariables": [],
"name": "local",
"description": "",
"dataProfile": "",
"executionInfoLocationName": ""
}
### Additional observation:
If we replace the Filter Rows transform in Apache Hop with a Java Filter
step, the execution time becomes almost identical to Pentaho Kettle (~2
seconds). This indicates the performance issue is specific to the Filter Rows
transform in Hop.
### Impact
This performance gap becomes significant when processing larger datasets and
when migrating existing Pentaho Kettle transformations to Apache Hop.
### Screenshots / Evidence
Below screenshots show the execution time of the Filter Rows transform:
- Apache Hop pipeline metrics showing ~6 seconds for Filter Rows
<img width="1907" height="697" alt="Image"
src="https://github.com/user-attachments/assets/0ebeb02d-34b1-4de0-954b-45fe71eff9ac"
/>
- (Optional) Pentaho Kettle metrics showing ~2 seconds for the same step
<img width="1906" height="1015" alt="Image"
src="https://github.com/user-attachments/assets/eb9c3be5-e797-4c0a-925f-13f71bdd06a7"
/>
### Attachments
Apache Hop pipeline file using the Filter Rows step: [Google Drive
Link](https://drive.google.com/drive/folders/1KUz1ufSe_TiVgJlmcBppA5PKzl5tl4zB?usp=drive_link)
Apache Hop pipeline file using the Java Filter step: [Google Drive
Link](https://drive.google.com/drive/folders/1SEr9pHMLrR7X9occqgYOTE7d0lqmaBuW?usp=drive_link)
### Additional Notes
Using Java Filter in Hop gives performance similar to Kettle
Suggest investigating potential inefficiencies in the Filter Rows transform
implementation in Hop
### Issue Priority
Priority: 1
### Issue Component
Component: Transforms
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]