PG1204 opened a new issue, #5277:
URL: https://github.com/apache/texera/issues/5277
### Task Summary
### Feature Summary
The HuggingFace inference operator (#5041) needs to cover ~20 HF pipeline
tasks (text-generation, image-classification, ASR, text-to-image, …). To land
it cleanly and let the per-task work proceed in parallel, the operator is
introduced via a dispatcher + per-task codegen architecture: a thin
`HuggingFaceInferenceOpDesc` selects a `TaskCodegen` based on the configured
task, and the selected codegen contributes the per-task Python payload + parse
snippets. Shared infrastructure (provider fallback, HTTP loop, response-parsing
framework) lives in `PythonCodegenBase`.
This issue covers shipping the dispatcher pattern + the first task family
(text-generation) end-to-end. Subsequent child issues add the image, audio /
media-generation, and QA / ranking task families by introducing new `*Codegen`
objects and registering them in the dispatcher map. The architecture lets each
task-family PR stay focused: a new task family means one new file plus one
entry in the dispatcher map — no surgery on the shared infrastructure or other
codegens.
Concretely, landing this would enable:
- A working HuggingFace operator on the workspace for text-generation tasks
against HF Hub and any OpenAI-compatible third-party provider (Cerebras, Groq,
Sambanova, Together, …).
- A clean extension point for the image / audio / QA task families to plug
into via subsequent PRs without modifying the operator class or the shared
Python infrastructure.
### Proposed Solution or Design
1. New files under
`common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/`:
- `HuggingFaceInferenceOpDesc.scala` — thin (~180-line) dispatcher
holding the `@JsonProperty` fields and the `registeredCodegens` map.
- `codegen/TaskCodegen.scala` — trait + `CodegenContext` case class;
default `tasks: Set[String] = Set(task)` for single-task codegens, overridable
by multi-task codegens.
- `codegen/PythonCodegenBase.scala` — shared provider-fallback (HF router
+ OpenAI-compatible third-party providers), `process_table` loop,
`_parse_response` framework, with two holes for the per-task payload + parse
snippets.
- `codegen/TextGenCodegen.scala` — text-generation's chat-completions
payload and `body["choices"][0]["message"]["content"]` parse.
2. Register `HuggingFaceInferenceOpDesc` in `LogicalOp.scala`'s
`@JsonSubTypes`.
3. Design constraints baked into the codegen:
- **Safe codegen via `EncodableString` + `pyb"..."`:** user-input string
fields are typed as `EncodableString` (`String @EncodableStringAnnotation`);
the `pyb` macro emits them as `self.decode_python_template('<base64>')` runtime
expressions instead of raw Python literals, so they never appear in the
generated source as-is. This is what satisfies `PythonCodeRawInvalidTextSpec`'s
leakage check.
- **Constants in `open(self)`:** per-instance attributes
(`self.MODEL_ID`, `self.PROMPT_COLUMN`, …) are assigned in the lifecycle method
so `self` is in scope for the decode call.
- **Codegen totality:** `generatePythonCode` never throws on arbitrary
`@JsonProperty` values — unknown task strings fall back to `TextGenCodegen`,
and the generated Python's `else` branch produces a generic `{"inputs":
prompt_value}` payload, matching the original monolithic operator's behavior.
Required by the regression test contract.
- **Defensive `MODEL_ID` validation at runtime:** generated Python
rejects malformed model IDs (path-traversal segments, query strings, fragments,
control characters) with a clear `ValueError` before any HF URL is composed.
References:
- Parent issue: #5041
- Stacked on: #5124 (REST resource — issue #5134)
- HF Inference Providers API: https://huggingface.co/docs/inference-providers
### Impact / Priority
(P2) Medium — required for the HuggingFace inference operator (#5041) to
function. Does not affect existing functionality.
### Affected Area
Workflow Engine (Amber) — operator descriptor + Python codegen.
### Task Type
- [ ] Refactor / Cleanup
- [ ] DevOps / Deployment / CI
- [ ] Testing / QA
- [ ] Documentation
- [ ] Performance
- [x] Other
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]