[PR] feat(huggingFace): refactor operator into per-task codegen + text-generation [texera]

via GitHub Thu, 28 May 2026 12:01:33 -0700


PG1204 opened a new pull request, #5278:
URL: https://github.com/apache/texera/pull/5278


   > ⚠️ This PR is stacked on #5124. Until that lands, the diff below includes 
#5124's `HuggingFaceModelResource.scala` and the 1-line registration in 
`TexeraWebApplication.scala`. The new code in this PR is everything under 
`common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/huggingFace/`
 and the new test under 
`common/workflow-operator/src/test/.../huggingFace/HuggingFaceInferenceOpDescSpec.scala`.
 Once #5124 merges, this diff will auto-clean to ~839 lines.
   
   ### What changes were proposed in this PR?
   
   Refactors the monolithic 1,278-line `HuggingFaceInferenceOpDesc` from the 
team's feature branch into a dispatcher + per-task codegen architecture and 
ships the first task family (text-generation):
   
   - `codegen/TaskCodegen.scala` introduces the trait + `CodegenContext`  that 
model per-task variation.
   - `codegen/PythonCodegenBase.scala` emits the shared provider-fallback / 
`process_table` / `_parse_response` infrastructure with two holes for the 
per-task payload and parse snippets.
   - `codegen/TextGenCodegen.scala` supplies text-generation's chat-completions 
payload and the `body["choices"][0 ["message"]["content"]` parse branch.
   - `HuggingFaceInferenceOpDesc.scala` becomes a thin (~180-line) dispatcher 
holding the `@JsonProperty` fields and the `registeredCodegens` map.
   
   User-input string fields are typed `EncodableString` and emitted via the 
`pyb"..."` macro so values reach Python as 
`self.decode_python_template('<base64>')` rather than raw literals. Class 
constants are assigned in `open(self)` so `self` is in scope for the decode 
call. The generated `process_table` runs a defensive `_HF_MODEL_ID_PATTERN` 
check at runtime before any HF URL is composed.
   
   The `TaskCodegen` trait also exposes a `tasks: Set[String]` default so a 
single codegen can register under multiple task strings, this becomes relevant 
in PR 3 (image family).
   
   ### Any related issues, documentation, or discussions?
   
   Tracked in #5277 & #5041(umbrella issue for the HuggingFace operator 
end-to-end implementation).
   
   Stacked on #5124 (PR 1 - REST resource).
   
   This is PR 2 of a multi-PR series landing the HuggingFace operator 
end-to-end. The full plan and umbrella issue live separately; this PR's scope 
is exactly the dispatcher pattern + text-generation codegen.
   
   ### How was this PR tested?
   
   - `sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean.
   - `sbt scalafmtCheck` clean.
   - `sbt "WorkflowOperator/testOnly 
org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec"` - 
10/10 pass (operator info, validation, codegen wiring, MODEL_ID runtime check, 
leak-prevention, clamping, schema).
   - `sbt "WorkflowOperator/testOnly 
org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` - 117/117 
descriptors `py_compile` cleanly, no raw-text leaks. The new operator is 
included in this scan.
   - Generated Python verified via `python3 -m py_compile` on a sample output.
   
   ### Was this PR authored or co-authored using generative AI tooling?
   
   Co-authored with Claude Opus 4.7
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] feat(huggingFace): refactor operator into per-task codegen + text-generation [texera]

Reply via email to