aicam opened a new issue, #4198:
URL: https://github.com/apache/texera/issues/4198
### Feature Summary
## Description
We propose enabling a standardized experience for users to bring and utilize
their own Machine Learning (ML) models within the Texera platform. To achieve
this, we need to adopt a unified protocol for the entire lifecycle of model
saving, loading, and execution.
After evaluating several standards, we recommend integrating **MLflow** as
the core protocol for model management in Texera.
## Motivation & User Persona
Texera serves two primary user groups with distinct needs:
1. **Students:** Who use the platform to learn the fundamentals of Machine
Learning and Data Science.
2. **Biomedical Engineers:** Who require heavy computation for tasks such
as sequence alignment and "shallow" machine learning (e.g., Scikit-Learn,
classic statistical models).
Currently, there is no standardized way for these users to import and run
pre-trained models seamlessly. Implementing a standard protocol will streamline
this workflow and enhance Texera's extensibility.
## Evaluation of Alternatives
We explored several options before selecting MLflow:
* **Hugging Face:**
* *Pros:* Excellent standards and ease of use; industry standard for
LLMs.
* *Cons:* Primarily focused on LLMs and Deep Learning. It does not offer
a comprehensive solution for managing the full lifecycle (storage to loading)
of general-purpose or "shallow" ML models often used by our target audience.
* **ONNX (Open Neural Network Exchange):**
* *Pros:* Great interoperability for deep learning models.
* *Cons:* Heavily focused on Neural Networks, making it less suitable
for the broad range of general ML libraries (like Scikit-Learn) that our
biomedical users rely on.
* **MLflow (Selected):**
* *Pros:* Supports a wide variety of libraries including TensorFlow,
PyTorch, and Scikit-Learn. Crucially, it manages the *entire* lifecycle from
standardizing the storage format to loading the model for inference.
### Proposed Solution or Design
## Proposed Implementation
The integration will leverage two existing architectural features within
Texera:
### 1. Model Storage (via LakeFS)
* We will utilize our existing **LakeFS** integration to store MLflow
artifacts.
* Models will be stored similarly to how we handle datasets, but with a key
difference: we will enforce the MLflow protocol/structure on the files during
upload to ensure compatibility.
### 2. Model Execution (New Operator)
* We will introduce a new operator type: `MLflow`.
* This will be built upon our existing **Python Native Operator**
infrastructure.
* The operator will automatically handle loading the model using the
standard `mlflow` library and executing inference against the input data stream.


### Impact / Priority
(P2)Medium – useful enhancement
### Affected Area
Workflow Engine (Amber)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]