The GitHub Actions job "Required Checks" on 
texera.git/gh-readonly-queue/main/pr-5278-3dab771a2fe3ea5bf97c4c69cfbd761f9cd01e54
 has succeeded.
Run started by GitHub user xuang7 (triggered by xuang7).

Head commit for run:
2b9add956c9e63c3c4f6e717221a0c5e33e54875 / Prateek Ganigi 
<[email protected]>
feat(huggingFace): refactor operator into per-task codegen + text-generation 
(#5278)

> ⚠️ This PR is stacked on #5124. Until that lands, the diff below
includes #5124's `HuggingFaceModelResource.scala` and the 1-line
registration in `TexeraWebApplication.scala`. The new code in this PR is
everything under
`common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/`
and the new test under
`common/workflow-operator/src/test/.../huggingFace/HuggingFaceInferenceOpDescSpec.scala`.
Once #5124 merges, this diff will auto-clean to ~839 lines.

### What changes were proposed in this PR?

Refactors the monolithic 1,278-line `HuggingFaceInferenceOpDesc` from
the team's feature branch into a dispatcher + per-task codegen
architecture and ships the first task family (text-generation):

- `codegen/TaskCodegen.scala` introduces the trait + `CodegenContext`
that model per-task variation.
- `codegen/PythonCodegenBase.scala` emits the shared provider-fallback /
`process_table` / `_parse_response` infrastructure with two holes for
the per-task payload and parse snippets.
- `codegen/TextGenCodegen.scala` supplies text-generation's
chat-completions payload and the `body["choices"][0
["message"]["content"]` parse branch.
- `HuggingFaceInferenceOpDesc.scala` becomes a thin (~180-line)
dispatcher holding the `@JsonProperty` fields and the
`registeredCodegens` map.

User-input string fields are typed `EncodableString` and emitted via the
`pyb"..."` macro so values reach Python as
`self.decode_python_template('<base64>')` rather than raw literals.
Class constants are assigned in `open(self)` so `self` is in scope for
the decode call. The generated `process_table` runs a defensive
`_HF_MODEL_ID_PATTERN` check at runtime before any HF URL is composed.

The `TaskCodegen` trait also exposes a `tasks: Set[String]` default so a
single codegen can register under multiple task strings, this becomes
relevant in PR 3 (image family).

### Any related issues, documentation, or discussions?

Tracked in #5277 & #5041(umbrella issue for the HuggingFace operator
end-to-end implementation).

Closes #5277 

Stacked on #5124 (PR 1 - REST resource).

This is PR 2 of a multi-PR series landing the HuggingFace operator
end-to-end. The full plan and umbrella issue live separately; this PR's
scope is exactly the dispatcher pattern + text-generation codegen.

### How was this PR tested?

- `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean.
- `sbt scalafmtCheck` clean.
- `sbt "WorkflowOperator/testOnly
org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"`
- 10/10 pass (operator info, validation, codegen wiring, MODEL_ID
runtime check, leak-prevention, clamping, schema).
- `sbt "WorkflowOperator/testOnly
org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` - 117/117
descriptors `py_compile` cleanly, no raw-text leaks. The new operator is
included in this scan.
- Generated Python verified via `python3 -m py_compile` on a sample
output.

### Was this PR authored or co-authored using generative AI tooling?

Co-authored with Claude Opus 4.7

---------

Co-authored-by: Elliot Lin <[email protected]>
Co-authored-by: Claude Opus 4.7 (1M context) <[email protected]>
Co-authored-by: Xuan Gu <[email protected]>

Report URL: https://github.com/apache/texera/actions/runs/27580569890

With regards,
GitHub Actions via GitBox

Reply via email to