anishshiva7 opened a new pull request, #5574:
URL: https://github.com/apache/texera/pull/5574
⚠️ This PR is stacked on `hf/04-audio-mediagen`. Until that lands, the diff
below may also include earlier HuggingFace task-family changes depending on
which base GitHub is showing. The new code in this PR is
`codegen/QaRankingCodegen.scala`, the QA/ranking-related additions to
`codegen/PythonCodegenBase.scala`, the new QA/ranking fields on
`HuggingFaceInferenceOpDesc.scala`, and the QA/ranking task tests in
`HuggingFaceInferenceOpDescSpec.scala`. Once PR 4 merges and this PR is
retargeted to `main`, the diff should auto-clean to the PR 5 QA/ranking changes
only.
What changes were proposed in this PR?
--------------------------------------
Adds the QA/ranking/classification task family --- 5 HF pipeline tasks ---
as a new `TaskCodegen` plugged into the dispatcher established by the
text-generation PR:
QA tasks: `question-answering`, `table-question-answering`
classification/ranking tasks: `zero-shot-classification`,
`sentence-similarity`, `text-ranking`
`codegen/QaRankingCodegen.scala` supplies the per-task payload + parse
Python branches for all 5 tasks.
`CodegenContext` is extended with `contextColumn`, `candidateLabels`, and
`sentencesColumn` (`EncodableString`).
`HuggingFaceInferenceOpDesc.scala` gains 3 new `@JsonProperty` fields and
registers `QaRankingCodegen` in the dispatcher.
`PythonCodegenBase.scala` grows to host the shared QA/ranking infrastructure:
- Per-row validation for the new column-named fields.
- `question-answering` payload handling with prompt + context.
- `table-question-answering` payload handling with table data.
- `zero-shot-classification` payload handling with candidate labels.
- `sentence-similarity` and `text-ranking` payload handling with sentence
inputs.
- Response parsing for QA/ranking outputs.
User-input strings continue to flow through `pyb"..."` + `EncodableString`
so they reach Python as `self.decode_python_template('<base64>')` rather than
raw literals. `PythonCodeRawInvalidTextSpec` still passes with 117/117
descriptors py_compile cleanly.
Any related issues, documentation, or discussions?
--------------------------------------------------
Tracking issue: Add HuggingFace question answering and ranking tasks
[apache#5292](https://github.com/apache/texera/issues/5292)
Closes Add HuggingFace question answering and ranking tasks
[apache#5292](https://github.com/apache/texera/issues/5292)
Stacked on: PR 4 audio/media generation tasks / `hf/04-audio-mediagen`
Parent issue: Add Hugging Face inference operator
[apache#5041](https://github.com/apache/texera/issues/5041)
Closed sibling issue: Add HuggingFaceModelResource REST endpoints for HF
operator UI [apache#5134](https://github.com/apache/texera/issues/5134)
How was this PR tested?
-----------------------
`sbt "WorkflowOperator/compile; WorkflowOperator/Test/compile"` clean.
`sbt "WorkflowOperator/testOnly
org.apache.texera.amber.operator.huggingFace.HuggingFaceInferenceOpDescSpec
org.apache.texera.amber.util.PythonCodeRawInvalidTextSpec"` --- 31 focused
tests pass, including HuggingFace QA/ranking task coverage and the raw Python
descriptor scan.
`sbt "WorkflowOperator / scalafmtCheck"` clean.
`sbt "WorkflowOperator / Test / scalafmtCheck"` clean.
`PythonCodeRawInvalidTextSpec` --- 117/117 descriptors py_compile cleanly
with the new operator code paths, no marker leaks.
Was this PR authored or co-authored using generative AI tooling?
----------------------------------------------------------------
Yes, co-authored with generative AI tooling (Codex).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]