GitHub user carloea2 added a comment to the discussion: Task ideas for the
dkNet-AI · Apache Texera Agent Hackathon
Implemented a prototype **Compiled C++ UDF Operator** for Texera.
The operator lets users write C++ directly inside the workflow UI, configure
input/output columns, compile the code, and run it as a workflow operator. The
MVP supports typed tuple/batch/table-style APIs, compiler errors, retained
input columns, configurable compiler flags, timeout, and batching.
To demonstrate why this is useful, I benchmarked the same deterministic
CPU-heavy matrix multiplication workload across C++, Java UDF, and Python UDF.
Each runtime receives its own independent source with the same seed/workload.
### Benchmark Results
| Runtime | Avg ms / row | Min ms | Max ms | Runs | Speedup vs Python |
|---|---:|---:|---:|---:|---:|
| C++ | 0.122 | 0.103 | 0.477 | 80 | 132.7x |
| Java | 1.073 | 0.388 | 5.673 | 80 | 15.1x |
| Python | 16.220 | 9.235 | 34.053 | 80 | 1.0x |
This shows the motivation clearly: Python is great for usability, but compiled
C++ can provide major speedups for CPU-heavy UDFs while staying inside Texera’s
visual workflow model.
### C++ UDF
```cpp
#include <chrono>
class MatrixMultiplyOperator : public texera::UDFOperator {
public:
texera::TupleOutput process_tuple(const texera::Tuple& tuple, int port)
override {
int trial = tuple.get("trial").as_int();
int n = tuple.get("matrix_size").as_int();
long long seed = tuple.get("seed").as_long();
auto start = std::chrono::high_resolution_clock::now();
double checksum = 0.0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
double cell = 0.0;
for (int k = 0; k < n; k++) {
double a = ((seed + trial * 97LL + i * 31LL + k * 17LL) %
1000LL) / 1000.0;
double b = ((seed + trial * 53LL + k * 13LL + j * 29LL) %
1000LL) / 1000.0;
cell += a * b;
}
checksum += cell;
}
}
auto end = std::chrono::high_resolution_clock::now();
double elapsed_ms = std::chrono::duration<double, std::milli>(end -
start).count();
return { texera::TupleLike{
texera::Value::string_value("cpp"),
texera::Value::double_value(checksum),
texera::Value::double_value(elapsed_ms)
}};
}
};
using TexeraUDFOperator = MatrixMultiplyOperator;
```
### Java UDF
```java
public TupleLike processTuple(Tuple tuple) {
int trial = ((Number) tuple.getField("trial")).intValue();
int n = ((Number) tuple.getField("matrix_size")).intValue();
long seed = ((Number) tuple.getField("seed")).longValue();
long start = System.nanoTime();
double checksum = 0.0;
for (int i = 0; i < n; i++) {
for (int j = 0; j < n; j++) {
double cell = 0.0;
for (int k = 0; k < n; k++) {
double a = ((seed + trial * 97L + i * 31L + k * 17L) % 1000L) /
1000.0;
double b = ((seed + trial * 53L + k * 13L + j * 29L) % 1000L) /
1000.0;
cell += a * b;
}
checksum += cell;
}
}
double elapsedMs = (System.nanoTime() - start) / 1_000_000.0;
Object[] inputFields = tuple.getFields();
Object[] outputFields = Arrays.copyOf(inputFields, inputFields.length + 3);
outputFields[inputFields.length] = "java";
outputFields[inputFields.length + 1] = checksum;
outputFields[inputFields.length + 2] = elapsedMs;
return TupleLike$.MODULE$.apply(Arrays.asList(outputFields));
}
```
### Python UDF
```python
from pytexera import *
import time
class ProcessTupleOperator(UDFOperatorV2):
@overrides
def process_tuple(self, tuple_: Tuple, port: int) ->
Iterator[Optional[TupleLike]]:
trial = int(tuple_["trial"])
n = int(tuple_["matrix_size"])
seed = int(tuple_["seed"])
start = time.perf_counter()
checksum = 0.0
for i in range(n):
for j in range(n):
cell = 0.0
for k in range(n):
a = ((seed + trial * 97 + i * 31 + k * 17) % 1000) / 1000.0
b = ((seed + trial * 53 + k * 13 + j * 29) % 1000) / 1000.0
cell += a * b
checksum += cell
elapsed_ms = (time.perf_counter() - start) * 1000.0
output = tuple_.as_dict()
output["runtime"] = "python"
output["checksum"] = checksum
output["elapsed_ms"] = elapsed_ms
yield output
```
https://github.com/user-attachments/assets/10fc6eee-8742-403e-9b11-11221edb1b9b
GitHub link:
https://github.com/apache/texera/discussions/5059#discussioncomment-16926806
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]