This is an automated email from the ASF dual-hosted git repository.
github-merge-queue[bot] pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/texera.git
The following commit(s) were added to refs/heads/main by this push:
new 94da3d9387 feat(python-notebook-migration): add LLM client for
notebook-to-workflow conversion (#5260)
94da3d9387 is described below
commit 94da3d93875f63179fa0ae92d4936155dffba68c
Author: Ryan Zhang <[email protected]>
AuthorDate: Thu Jun 25 11:42:58 2026 -0700
feat(python-notebook-migration): add LLM client for notebook-to-workflow
conversion (#5260)
### What changes were proposed in this PR?
Introduces the frontend LLM session class that converts a Jupyter
notebook into a Texera workflow JSON plus a bidirectional cell to
operator mapping, along with the prompt library it uses. Two files under
`frontend/src/app/workspace/service/notebook-migration/`, totalling ~700
lines (~410 of which is prompt text).
**`migration-llm.ts`** — defines `NotebookMigrationLLM`, an
`@Injectable` class wrapping a Vercel AI SDK chat session against the
LiteLLM proxy already exposed on `main` at `/api/chat/completion`.
- `initialize(modelType, apiKey)` — builds an OpenAI-compatible chat
client via `createOpenAI({ baseURL: AppSettings.getApiEndpoint() })`,
seeds the message history with Texera documentation as `system`
messages.
- `verifyConnection()` — does a 10-token `ping` call to validate that
the API key works against the configured model.
- `convertNotebookToWorkflow(notebook)` — extracts code cells (each
tagged with a UUID in `metadata.uuid`), sends `WORKFLOW_PROMPT` + the
notebook to get a JSON of UDF operators / edges, then sends
`MAPPING_PROMPT` to get the cell↔operator mapping. Assembles a complete
Texera workflow JSON (`PythonUDFV2` operators with stub input/output
ports, links derived from the LLM's edge list, default settings) plus a
bidirectional `operator_to_cell` / `cell_to_operator` mapping. Returns
both as a JSON string.
- `close()` — clears the message history and the model reference.
**`migration-prompts.ts`** — string constants used by
`migration-llm.ts`: `TEXERA_OVERVIEW`, `TUPLE_DOCUMENTATION`,
`TABLE_DOCUMENTATION`, `OPERATOR_DOCUMENTATION`,
`UDF_INPUT_PORT_DOCUMENTATION`, `EXAMPLE_OF_GOOD_CONVERSION`,
`VISUALIZER_DOCUMENTATION`, `EXAMPLE_OF_MULTIPLE_UDF_CONVERSION`,
`WORKFLOW_PROMPT`, `MAPPING_PROMPT`.
### Any related issues, documentation, discussions?
Closes #5259
Parent issue #4301
### How was this PR tested?
No unit tests were included for these reasons:
- A large portion of the changes are prompt text, which are not
testable, only readable. However the prompt text can be changed to
improve the performance of the LLM.
- Testing would require mocking a significant amount of logic that will
be introduced in later PRs, since the logic in `migration-llm.ts` is
parsing a response.
However I am open to writing tests based on review feedback.
### Was this PR authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Claude Opus 4.7)
---------
Co-authored-by: Meng Wang <[email protected]>
---
frontend/package.json | 1 +
.../notebook-migration/migration-llm.spec.ts | 306 +++++++++++++++
.../service/notebook-migration/migration-llm.ts | 367 ++++++++++++++++++
.../notebook-migration/migration-prompts.ts | 414 +++++++++++++++++++++
frontend/yarn.lock | 13 +
5 files changed, 1101 insertions(+)
diff --git a/frontend/package.json b/frontend/package.json
index 78f2d10355..418b166ee8 100644
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -21,6 +21,7 @@
"private": true,
"dependencies": {
"@abacritt/angularx-social-login": "2.3.0",
+ "@ai-sdk/openai": "2.0.67",
"@ali-hm/angular-tree-component": "12.0.5",
"@angular/animations": "21.2.10",
"@angular/cdk": "21.2.8",
diff --git
a/frontend/src/app/workspace/service/notebook-migration/migration-llm.spec.ts
b/frontend/src/app/workspace/service/notebook-migration/migration-llm.spec.ts
new file mode 100644
index 0000000000..58c17cdfc3
--- /dev/null
+++
b/frontend/src/app/workspace/service/notebook-migration/migration-llm.spec.ts
@@ -0,0 +1,306 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import { NotebookMigrationLLM, Notebook } from "./migration-llm";
+import { GuiConfigService } from "../../../common/service/gui-config.service";
+import { WorkflowUtilService } from
"../workflow-graph/util/workflow-util.service";
+import { generateText } from "ai";
+import type { Mock } from "vitest";
+
+// The LLM transport and OpenAI client are mocked so the tests exercise only
the
+// deterministic transformation (parsing, operator/edge construction,
cell<->operator mapping).
+vi.mock("ai", () => ({ generateText: vi.fn() }));
+vi.mock("@ai-sdk/openai", () => ({
+ createOpenAI: vi.fn(() => ({ chat: vi.fn(() => ({})) })),
+}));
+
+const mockGenerateText = generateText as unknown as Mock;
+
+describe("NotebookMigrationLLM", () => {
+ let opIdCounter = 0;
+ let stubUtil: WorkflowUtilService;
+
+ // Build a fresh, initialized session with stubbed dependencies. The stubbed
+ // getNewOperatorPredicate hands out deterministic ids (PythonUDFV2-0, -1,
...).
+ function makeLLM(): NotebookMigrationLLM {
+ const stubConfig = {
+ env: {
+ pythonNotebookMigrationEnabled: true,
+ defaultDataTransferBatchSize: 400,
+ defaultExecutionMode: "PIPELINED",
+ },
+ } as unknown as GuiConfigService;
+
+ stubUtil = {
+ getNewOperatorPredicate: vi.fn((operatorType: string,
customDisplayName?: string) => ({
+ operatorID: `${operatorType}-${opIdCounter++}`,
+ operatorType,
+ operatorVersion: "test-version",
+ operatorProperties: { workers: 1, defaultEnv: true, envName: "" },
+ inputPorts: [{ portID: "input-0", disallowMultiInputs: false }],
+ outputPorts: [{ portID: "output-0" }],
+ showAdvanced: false,
+ isDisabled: false,
+ customDisplayName,
+ dynamicInputPorts: true,
+ dynamicOutputPorts: true,
+ })),
+ } as unknown as WorkflowUtilService;
+
+ const llm = new NotebookMigrationLLM(stubConfig, stubUtil);
+ // Pass an explicit token so tests don't depend on
AuthService/localStorage state.
+ llm.initialize("gpt-5-mini", "test-token");
+ return llm;
+ }
+
+ function codeCell(uuid: string | undefined, source: string) {
+ return { cell_type: "code", metadata: uuid === undefined ? {} : { uuid },
source };
+ }
+
+ // Queue the two responses convertNotebookToWorkflow consumes, in order.
+ function mockResponses(workflowResponse: string, mappingResponse: string) {
+ mockGenerateText.mockResolvedValueOnce({ text: workflowResponse
}).mockResolvedValueOnce({ text: mappingResponse });
+ }
+
+ beforeEach(() => {
+ opIdCounter = 0;
+ mockGenerateText.mockReset();
+ });
+
+ describe("convertNotebookToWorkflow", () => {
+ it("builds operators, links, positions, and a bidirectional mapping",
async () => {
+ const notebook: Notebook = {
+ cells: [codeCell("CELL1", "print(1)"), codeCell("CELL2", "print(2)")],
+ };
+ mockResponses(
+ JSON.stringify({
+ code: { UDF1: "code1", UDF2: "code2" },
+ edges: [["UDF1", "UDF2"]],
+ outputs: { UDF1: ["a", "b"], UDF2: ["c"] },
+ }),
+ JSON.stringify({ UDF1: ["CELL1"], UDF2: ["CELL2"] })
+ );
+
+ const { workflowJSON, workflowNotebookMapping } = JSON.parse(await
makeLLM().convertNotebookToWorkflow(notebook));
+
+ expect(workflowJSON.operators.map((op: any) =>
op.operatorID)).toEqual(["PythonUDFV2-0", "PythonUDFV2-1"]);
+ expect(workflowJSON.operators[0].operatorProperties).toMatchObject({
+ code: "code1",
+ retainInputColumns: false,
+ });
+ expect(workflowJSON.operatorPositions).toEqual({
+ "PythonUDFV2-0": { x: 140, y: 0 },
+ "PythonUDFV2-1": { x: 280, y: 0 },
+ });
+ expect(workflowJSON.links).toHaveLength(1);
+ expect(workflowJSON.links[0].source).toEqual({ operatorID:
"PythonUDFV2-0", portID: "output-0" });
+ expect(workflowJSON.links[0].target).toEqual({ operatorID:
"PythonUDFV2-1", portID: "input-0" });
+ expect(workflowNotebookMapping.operator_to_cell).toEqual({
+ "PythonUDFV2-0": ["CELL1"],
+ "PythonUDFV2-1": ["CELL2"],
+ });
+ expect(workflowNotebookMapping.cell_to_operator).toEqual({
+ CELL1: ["PythonUDFV2-0"],
+ CELL2: ["PythonUDFV2-1"],
+ });
+ // Settings come from GUI config defaults, not hardcoded values.
+ expect(workflowJSON.settings).toEqual({ dataTransferBatchSize: 400,
executionMode: "PIPELINED" });
+ });
+
+ // Intermediate UDFs (a source of some edge) keep "binary" for object
passing; terminal
+ // UDFs (no outgoing edge) default to "string" so the result panel renders
typed values.
+ it("types intermediate UDF outputs as binary and terminal UDF outputs as
string", async () => {
+ const notebook: Notebook = { cells: [codeCell("CELL1", "a"),
codeCell("CELL2", "b")] };
+ mockResponses(
+ JSON.stringify({
+ code: { UDF1: "code1", UDF2: "code2" },
+ edges: [["UDF1", "UDF2"]],
+ outputs: { UDF1: ["x"], UDF2: ["y"] },
+ }),
+ JSON.stringify({ UDF1: ["CELL1"], UDF2: ["CELL2"] })
+ );
+
+ const { workflowJSON } = JSON.parse(await
makeLLM().convertNotebookToWorkflow(notebook));
+
+ // UDF1 is a source (intermediate) -> binary; UDF2 is terminal -> string.
+
expect(workflowJSON.operators[0].operatorProperties.outputColumns).toEqual([
+ { attributeName: "x", attributeType: "binary" },
+ ]);
+
expect(workflowJSON.operators[1].operatorProperties.outputColumns).toEqual([
+ { attributeName: "y", attributeType: "string" },
+ ]);
+ });
+
+ it("maps multiple cells onto the same UDF, and one cell onto multiple
UDFs", async () => {
+ const notebook: Notebook = {
+ cells: [codeCell("CELL1", "a"), codeCell("CELL2", "b")],
+ };
+ mockResponses(
+ JSON.stringify({ code: { UDF1: "c1", UDF2: "c2" }, edges: [], outputs:
{} }),
+ JSON.stringify({ UDF1: ["CELL1", "CELL2"], UDF2: ["CELL1"] })
+ );
+
+ const { workflowNotebookMapping } = JSON.parse(await
makeLLM().convertNotebookToWorkflow(notebook));
+
+ expect(workflowNotebookMapping.operator_to_cell).toEqual({
+ "PythonUDFV2-0": ["CELL1", "CELL2"],
+ "PythonUDFV2-1": ["CELL1"],
+ });
+ expect(workflowNotebookMapping.cell_to_operator).toEqual({
+ CELL1: ["PythonUDFV2-0", "PythonUDFV2-1"],
+ CELL2: ["PythonUDFV2-0"],
+ });
+ });
+
+ it("skips (with a warning) an edge that references an unknown UDF id",
async () => {
+ const warn = vi.spyOn(console, "warn").mockImplementation(() => {});
+ const notebook: Notebook = { cells: [codeCell("CELL1", "a")] };
+ mockResponses(
+ JSON.stringify({ code: { UDF1: "c1" }, edges: [["UDF1", "UDFX"]],
outputs: {} }),
+ JSON.stringify({ UDF1: ["CELL1"] })
+ );
+
+ const { workflowJSON } = JSON.parse(await
makeLLM().convertNotebookToWorkflow(notebook));
+
+ // The dangling edge is dropped rather than producing an undefined
endpoint.
+ expect(workflowJSON.links).toEqual([]);
+ expect(warn).toHaveBeenCalledWith(expect.stringContaining("UDFX"));
+ warn.mockRestore();
+ });
+
+ it("skips (with a warning) a mapping entry that references an unknown UDF
id", async () => {
+ const warn = vi.spyOn(console, "warn").mockImplementation(() => {});
+ const notebook: Notebook = { cells: [codeCell("CELL1", "a")] };
+ mockResponses(
+ JSON.stringify({ code: { UDF1: "c1" }, edges: [], outputs: {} }),
+ JSON.stringify({ UDF1: ["CELL1"], UDFTYPO: ["CELL1"] })
+ );
+
+ const { workflowNotebookMapping } = JSON.parse(await
makeLLM().convertNotebookToWorkflow(notebook));
+
+ // Only the valid UDF id survives in the mapping.
+ expect(workflowNotebookMapping.operator_to_cell).toEqual({
"PythonUDFV2-0": ["CELL1"] });
+ expect(workflowNotebookMapping.cell_to_operator).toEqual({ CELL1:
["PythonUDFV2-0"] });
+ expect(warn).toHaveBeenCalledWith(expect.stringContaining("UDFTYPO"));
+ warn.mockRestore();
+ });
+
+ it("handles empty code, edges, and outputs", async () => {
+ const notebook: Notebook = { cells: [] };
+ mockResponses(JSON.stringify({ code: {}, edges: [], outputs: {} }),
JSON.stringify({}));
+
+ const { workflowJSON, workflowNotebookMapping } = JSON.parse(await
makeLLM().convertNotebookToWorkflow(notebook));
+
+ expect(workflowJSON.operators).toEqual([]);
+ expect(workflowJSON.links).toEqual([]);
+ expect(workflowNotebookMapping.operator_to_cell).toEqual({});
+ expect(workflowNotebookMapping.cell_to_operator).toEqual({});
+ });
+
+ it("rejects when a code cell is missing metadata.uuid", async () => {
+ const notebook: Notebook = { cells: [codeCell(undefined, "print(1)")] };
+
+ await
expect(makeLLM().convertNotebookToWorkflow(notebook)).rejects.toThrow(/metadata\.uuid/);
+ // It fails before prompting, so the LLM is never called.
+ expect(mockGenerateText).not.toHaveBeenCalled();
+ });
+
+ it("joins array-form cell source (nbformat lines) without inserting
commas", async () => {
+ const notebook: Notebook = {
+ cells: [
+ {
+ cell_type: "code",
+ metadata: { uuid: "CELL1" },
+ source: ["import pandas as pd\n", "x = 1\n"],
+ },
+ ],
+ };
+ mockResponses(
+ JSON.stringify({ code: { UDF1: "c1" }, edges: [], outputs: {} }),
+ JSON.stringify({ UDF1: ["CELL1"] })
+ );
+
+ await makeLLM().convertNotebookToWorkflow(notebook);
+
+ const allPromptContent = mockGenerateText.mock.calls
+ .flatMap(call => call[0].messages.map((m: any) => m.content))
+ .join("\n");
+ expect(allPromptContent).toContain("import pandas as pd\nx = 1\n");
+ expect(allPromptContent).not.toContain("import pandas as pd\n,");
+ });
+
+ it("resets conversation history between conversions so a prior notebook
does not leak", async () => {
+ const llm = makeLLM();
+
+ // First conversion (notebook AAA) on the instance.
+ mockResponses(
+ JSON.stringify({ code: { UDF1: "codeAAA" }, edges: [], outputs: {} }),
+ JSON.stringify({ UDF1: ["AAA"] })
+ );
+ await llm.convertNotebookToWorkflow({ cells: [codeCell("AAA", "a = 1")]
});
+
+ // Second conversion (notebook BBB) on the SAME instance, no
close()/initialize() between.
+ mockResponses(
+ JSON.stringify({ code: { UDF1: "codeBBB" }, edges: [], outputs: {} }),
+ JSON.stringify({ UDF1: ["BBB"] })
+ );
+ await llm.convertNotebookToWorkflow({ cells: [codeCell("BBB", "b = 2")]
});
+
+ // The 3rd generateText call is the workflow prompt of the second
conversion.
+ const secondConversionMessages =
mockGenerateText.mock.calls[2][0].messages.map((m: any) =>
m.content).join("\n");
+
+ expect(secondConversionMessages).toContain("# START BBB");
+ expect(secondConversionMessages).not.toContain("AAA");
+ expect(secondConversionMessages).not.toContain("codeAAA");
+ });
+ });
+
+ describe("parseJsonResponse", () => {
+ // parseJsonResponse is private; cast to access it directly for focused
coverage.
+ const parse = (raw: string) => (makeLLM() as any).parseJsonResponse(raw,
"workflow");
+
+ it("parses bare JSON", () => {
+ expect(parse('{"a":1}')).toEqual({ a: 1 });
+ });
+
+ it("strips a ```json fence", () => {
+ expect(parse('```json\n{"a":1}\n```')).toEqual({ a: 1 });
+ });
+
+ it("strips a plain ``` fence", () => {
+ expect(parse('```\n{"a":1}\n```')).toEqual({ a: 1 });
+ });
+
+ it("tolerates surrounding whitespace and newlines around the fence", () =>
{
+ expect(parse('\n\n ```json\n{"a":1}\n``` \n\n')).toEqual({ a: 1 });
+ });
+
+ it("throws a contextual error on malformed JSON", () => {
+ expect(() => parse("not json")).toThrow("Failed to parse LLM workflow
response as JSON");
+ });
+
+ it("extracts fenced JSON even when surrounded by prose", () => {
+ expect(parse('Here is the JSON:
```json\n{"a":1}\n```\nThanks!')).toEqual({ a: 1 });
+ });
+
+ it("extracts the outermost object from fence-less prose", () => {
+ expect(parse('Sure! {"a":1} hope that helps')).toEqual({ a: 1 });
+ });
+ });
+});
diff --git
a/frontend/src/app/workspace/service/notebook-migration/migration-llm.ts
b/frontend/src/app/workspace/service/notebook-migration/migration-llm.ts
new file mode 100644
index 0000000000..2922c3ee0e
--- /dev/null
+++ b/frontend/src/app/workspace/service/notebook-migration/migration-llm.ts
@@ -0,0 +1,367 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import { Injectable } from "@angular/core";
+import { GuiConfigService } from "../../../common/service/gui-config.service";
+import { AuthService } from "../../../common/service/user/auth.service";
+import { createOpenAI } from "@ai-sdk/openai";
+import { generateText, type ModelMessage } from "ai";
+import { AppSettings } from "../../../common/app-setting";
+import { v4 as uuidv4 } from "uuid";
+import { WorkflowUtilService } from
"../workflow-graph/util/workflow-util.service";
+import { OperatorPredicate } from "../../types/workflow-common.interface";
+import { WorkflowSettings } from "../../../common/type/workflow";
+import {
+ TEXERA_OVERVIEW,
+ TUPLE_DOCUMENTATION,
+ TABLE_DOCUMENTATION,
+ OPERATOR_DOCUMENTATION,
+ UDF_INPUT_PORT_DOCUMENTATION,
+ EXAMPLE_OF_GOOD_CONVERSION,
+ VISUALIZER_DOCUMENTATION,
+ EXAMPLE_OF_MULTIPLE_UDF_CONVERSION,
+ WORKFLOW_PROMPT,
+ MAPPING_PROMPT,
+} from "./migration-prompts";
+
+interface Cell {
+ cell_type: string;
+ metadata: { [key: string]: any };
+ // nbformat stores source as either a single string or an array of line
strings.
+ source: string | string[];
+}
+
+export interface Notebook {
+ cells: Cell[];
+}
+
+interface WorkflowJSON {
+ operators: OperatorPredicate[];
+ operatorPositions: Record<string, { x: number; y: number }>;
+ links: any[];
+ commentBoxes: any[];
+ settings: WorkflowSettings;
+}
+
+interface CombinedMapping {
+ operator_to_cell: Record<string, string[]>;
+ cell_to_operator: Record<string, string[]>;
+}
+
+/**
+ * Wraps a single LLM chat session that converts a Jupyter notebook into a
Texera
+ * workflow plus a cell<->operator mapping.
+ *
+ * Lifecycle: `initialize()` -> `verifyConnection()` (optional) ->
+ * `convertNotebookToWorkflow()` -> `close()`. The session keeps a running
`messages`
+ * history shared by the prompts within one conversion.
`convertNotebookToWorkflow()`
+ * resets that history to the documentation prelude at its start, so the same
instance
+ * can convert multiple notebooks without leaking one conversion's context
into the next.
+ *
+ * Output column types: intermediate UDFs declare their output columns as
`binary` so rich
+ * Python objects (DataFrames, arrays, models) round-trip between operators
via pickle.
+ * Terminal UDFs (no outgoing edge) declare their outputs as `string` so the
result panel
+ * renders viewable values rather than opaque binary blobs.
+ */
+@Injectable()
+export class NotebookMigrationLLM {
+ private model: any;
+ private messages: ModelMessage[] = [];
+ private initialized = false;
+
+ private static readonly DOCUMENTATION: string[] = [
+ TEXERA_OVERVIEW,
+ TUPLE_DOCUMENTATION,
+ TABLE_DOCUMENTATION,
+ OPERATOR_DOCUMENTATION,
+ EXAMPLE_OF_GOOD_CONVERSION,
+ VISUALIZER_DOCUMENTATION,
+ UDF_INPUT_PORT_DOCUMENTATION,
+ EXAMPLE_OF_MULTIPLE_UDF_CONVERSION,
+ ];
+
+ constructor(
+ private config: GuiConfigService,
+ private workflowUtilService: WorkflowUtilService
+ ) {}
+
+ private get enabled(): boolean {
+ return this.config.env.pythonNotebookMigrationEnabled;
+ }
+
+ private assertEnabled(): void {
+ if (!this.enabled) {
+ throw new Error("Notebook migration feature is disabled");
+ }
+ }
+
+ /**
+ * Seed the conversation with the Texera documentation prelude, discarding
any
+ * prior conversation. Used by initialize() and at the start of each
conversion.
+ */
+ private seedDocumentation(): void {
+ this.messages = NotebookMigrationLLM.DOCUMENTATION.map(
+ (doc): ModelMessage => ({
+ role: "system",
+ content: doc,
+ })
+ );
+ }
+
+ private parseJsonResponse(raw: string, context: string): any {
+ let text = raw.trim();
+
+ // Prefer the contents of a fenced code block if present (```json ... ```
or ``` ... ```),
+ // even when wrapped in prose. Otherwise fall back to the outermost {...}
object.
+ const fenced = text.match(/```(?:[a-zA-Z]+)?\s*([\s\S]*?)```/);
+ if (fenced) {
+ text = fenced[1].trim();
+ } else {
+ const firstBrace = text.indexOf("{");
+ const lastBrace = text.lastIndexOf("}");
+ if (firstBrace !== -1 && lastBrace > firstBrace) {
+ text = text.slice(firstBrace, lastBrace + 1);
+ }
+ }
+
+ try {
+ return JSON.parse(text);
+ } catch (err) {
+ throw new Error(`Failed to parse LLM ${context} response as JSON: ${(err
as Error).message}`);
+ }
+ }
+
+ /**
+ * Initialize a new LLM session with Texera documentation
+ */
+ public initialize(modelType: string = "gpt-5-mini", accessToken: string =
AuthService.getAccessToken() ?? ""): void {
+ this.assertEnabled();
+ this.model = createOpenAI({
+ baseURL: new URL(`${AppSettings.getApiEndpoint()}`,
document.baseURI).toString(),
+ // The /api/chat/* LiteLLM proxy authenticates the caller with the
Texera JWT. The AI SDK
+ // sends this value verbatim as `Authorization: Bearer <token>`, so we
pass the user's
+ // access token; the backend validates it, then substitutes the LiteLLM
master key upstream.
+ apiKey: accessToken,
+ }).chat(modelType);
+
+ this.seedDocumentation();
+
+ this.initialized = true;
+ }
+
+ /**
+ * Verify the connection to the LLM using the current access token
+ */
+ public async verifyConnection(): Promise<boolean> {
+ if (!this.enabled) return false;
+ if (!this.initialized) {
+ throw new Error("LLM session not initialized");
+ }
+
+ try {
+ await generateText({
+ model: this.model,
+ messages: [
+ {
+ role: "user",
+ content: "ping",
+ },
+ ],
+ maxOutputTokens: 10,
+ });
+
+ return true;
+ } catch (err) {
+ console.error("API key verification failed:", err);
+ return false;
+ }
+ }
+
+ /**
+ * Send a prompt and receive a response.
+ * All prior documentation and conversation is preserved.
+ */
+ private async sendPrompt(prompt: string): Promise<string> {
+ if (!this.initialized) {
+ throw new Error("LLM session not initialized");
+ }
+
+ this.messages.push({
+ role: "user",
+ content: prompt,
+ });
+
+ const result = await generateText({
+ model: this.model,
+ messages: this.messages,
+ });
+
+ this.messages.push({
+ role: "assistant",
+ content: result.text,
+ });
+
+ return result.text;
+ }
+
+ /**
+ * Send a Jupyter Notebook to be converted into a workflow and mapping.
+ */
+ public async convertNotebookToWorkflow(notebook: Notebook): Promise<string> {
+ this.assertEnabled();
+ if (!this.initialized) {
+ throw new Error("LLM session not initialized");
+ }
+
+ // Reset to the documentation prelude so a prior conversion's
prompts/responses
+ // don't leak into this one. The two sendPrompt calls below still share
history.
+ this.seedDocumentation();
+
+ const codeCells = notebook.cells.filter(cell => cell.cell_type === "code");
+
+ // Every code cell must carry a unique metadata.uuid; it is the join key
for the
+ // cell<->operator mapping. Without it, untagged cells collide on the
"undefined" marker.
+ const untagged = codeCells.find(cell => cell.metadata?.uuid == null ||
String(cell.metadata.uuid).trim() === "");
+ if (untagged) {
+ throw new Error("Notebook code cells must each have a metadata.uuid
before conversion");
+ }
+
+ const notebookString = codeCells
+ .map(cell => {
+ const uuid = String(cell.metadata.uuid);
+ // nbformat line arrays already include trailing newlines, so join
with "".
+ const source = Array.isArray(cell.source) ? cell.source.join("") :
cell.source;
+ return `# START ${uuid}\n${source}\n# END ${uuid}`;
+ })
+ .join("\n\n");
+
+ const workflow = await
this.sendPrompt(`${WORKFLOW_PROMPT}\n${notebookString}`);
+ const mapping = await this.sendPrompt(MAPPING_PROMPT);
+
+ // Remove ```json blocks and parse
+ const udfLLMResponse = this.parseJsonResponse(workflow, "workflow");
+
+ const workflowJSON: WorkflowJSON = {
+ operators: [],
+ operatorPositions: {},
+ links: [],
+ commentBoxes: [],
+ settings: {
+ dataTransferBatchSize: this.config.env.defaultDataTransferBatchSize,
+ executionMode: this.config.env.defaultExecutionMode,
+ },
+ };
+
+ const udfMappingToUUID: Record<string, string> = {};
+
+ // UDFs that are never the source of an edge are terminal (result-facing).
Their outputs
+ // default to "string" so the result panel renders typed values;
intermediate UDFs keep
+ // "binary" so rich objects (DataFrames, arrays, models) round-trip
between operators via pickle.
+ const edgeSources = new Set<string>((udfLLMResponse.edges ||
[]).map(([source]: [string, string]) => source));
+
+ Object.entries(udfLLMResponse.code).forEach(([udfId, udfCode], i) => {
+ let udfOutputColumns: { attributeName: string; attributeType: string }[]
= [];
+ if (udfLLMResponse.outputs && udfLLMResponse.outputs[udfId]) {
+ const attributeType = edgeSources.has(udfId) ? "binary" : "string";
+ udfOutputColumns = udfLLMResponse.outputs[udfId].map((attr: string) =>
({
+ attributeName: attr,
+ attributeType,
+ }));
+ }
+
+ // Build the operator from the live PythonUDFV2 schema so the
operatorVersion, ports, and
+ // property defaults track the backend definition, then overlay the
generated code/outputs.
+ const base =
this.workflowUtilService.getNewOperatorPredicate("PythonUDFV2", udfId);
+ const operator: OperatorPredicate = {
+ ...base,
+ operatorProperties: {
+ ...base.operatorProperties,
+ code: udfCode,
+ retainInputColumns: false,
+ outputColumns: udfOutputColumns,
+ },
+ };
+
+ udfMappingToUUID[udfId] = operator.operatorID;
+ workflowJSON.operators.push(operator);
+ workflowJSON.operatorPositions[operator.operatorID] = { x: 140 * (i +
1), y: 0 };
+ });
+
+ const knownUdfIds = new Set(Object.keys(udfMappingToUUID));
+
+ // Add links/edges. Skip (with a warning) any edge that references a UDF
id the LLM
+ // never defined in `code`, rather than emitting a link with an undefined
endpoint.
+ (udfLLMResponse.edges || []).forEach(([source, target]: [string, string])
=> {
+ if (!knownUdfIds.has(source) || !knownUdfIds.has(target)) {
+ console.warn(`Skipping edge with unknown UDF id: ${source} ->
${target}`);
+ return;
+ }
+ workflowJSON.links.push({
+ linkID: `link-${uuidv4()}`,
+ source: {
+ operatorID: udfMappingToUUID[source],
+ portID: "output-0",
+ },
+ target: {
+ operatorID: udfMappingToUUID[target],
+ portID: "input-0",
+ },
+ });
+ });
+
+ // Parse mapping
+ const parsedMapping: Record<string, string[]> =
this.parseJsonResponse(mapping, "mapping");
+
+ const udfToCell: Record<string, string[]> = {};
+ const cellToUdf: Record<string, string[]> = {};
+
+ Object.entries(parsedMapping).forEach(([udf, cells]) => {
+ if (!knownUdfIds.has(udf)) {
+ console.warn(`Skipping mapping entry with unknown UDF id: ${udf}`);
+ return;
+ }
+ const udfUUID = udfMappingToUUID[udf];
+ udfToCell[udfUUID] = cells;
+ cells.forEach(cell => {
+ if (!cellToUdf[cell]) {
+ cellToUdf[cell] = [udfUUID];
+ } else {
+ cellToUdf[cell].push(udfUUID);
+ }
+ });
+ });
+
+ const workflowNotebookMapping: CombinedMapping = {
+ operator_to_cell: udfToCell,
+ cell_to_operator: cellToUdf,
+ };
+
+ return JSON.stringify({ workflowJSON, workflowNotebookMapping });
+ }
+
+ /**
+ * Closes the session.
+ * Clears all context and releases references.
+ */
+ public close(): void {
+ this.messages = [];
+ this.model = null;
+ this.initialized = false;
+ }
+}
diff --git
a/frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts
b/frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts
new file mode 100644
index 0000000000..2594d2d58e
--- /dev/null
+++ b/frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts
@@ -0,0 +1,414 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements. See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership. The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied. See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+// TEXERA DOCUMENTATION
+
+// https://github.com/apache/texera/wiki/Guide-to-Use-a-Python-UDF
+export const TEXERA_OVERVIEW = `
+You are a robust compiler that takes python code and translates it to our
personal workflow environment Texera that uses python.
+
+ Texera is a data analytics tool that uses workflows to do machine learning
and data analytics computation. Users are able to drag and drop operators and
connect their inputs and outputs in a workflow graphical user interface, which
the code we are going to create.
+
+Texera is able to use Python user defined functions. Documentation of a Python
UDF in Texera follows:
+ Process Data APIs
+
+There are three APIs to process the data in different units.
+
+ Tuple API.
+
+ class ProcessTupleOperator(UDFOperatorV2):
+
+def process_tuple(self, tuple_: Tuple, port: int) ->
Iterator[Optional[TupleLike]]:
+yield tuple_
+
+Tuple API takes one input tuple from a port at a time. It returns an iterator
of optional TupleLike instances. A TupleLike is any data structure that
supports key-value pairs, such as pytexera.Tuple, dict, defaultdict,
NamedTuple, etc.
+
+ Tuple API is useful for implementing functional operations which are applied
to tuples one by one, such as map, reduce, and filter.
+
+ Table API.
+
+ class ProcessTableOperator(UDFTableOperator):
+
+def process_table(self, table: Table, port: int) ->
Iterator[Optional[TableLike]]:
+yield table
+
+Table API consumes a Table at a time, which consists of all the tuples from a
port. It returns an iterator of optional TableLike instances. A TableLike is a
collection of TupleLike, and currently, we support pytexera.Table and
pandas.DataFrame as a TableLike instance. More flexible types will be supported
down the road.
+
+ Table API is useful for implementing blocking operations that will consume
all the data from one port, such as join, sort, and machine learning training.
+
+ Batch API.
+
+ class ProcessBatchOperator(UDFBatchOperator):
+
+BATCH_SIZE = 10
+
+def process_batch(self, batch: Batch, port: int) ->
Iterator[Optional[BatchLike]]:
+yield batch
+
+Batch API consumes a batch of tuples at a time. Similar to Table, a Batch is
also a collection of Tuples; however, its size is defined by the BATCH_SIZE,
and one port can have multiple batches. It returns an iterator of optional
BatchLike instances. A BatchLike is a collection of TupleLike, and currently,
we support pytexera.Batch and pandas.DataFrame as a BatchLike instance. More
flexible types will be supported down the road.
+
+ The Batch API serves as a hybrid API combining the features of both the
Tuple and Table APIs. It is particularly valuable for striking a balance
between time and space considerations, offering a trade-off that optimizes
efficiency.
+
+ All three APIs can return an empty iterator by yield None.
+
+ The template code for a Python UDF follows: MAKE SURE TO USE THE CLASS NAMES
AND FUNCTIONS DEFINED, THIS IS A MUST FOR THE PROGRAM TO WORK. SELECT 1 OUT OF
THE 3 PROCESSING OPERATOR FUNCTIONS TO BUILD DEPENDING ON THE CONTEXT OF CODE
TRANSLATION.
+# Choose from the following templates:
+ #
+# from pytexera import *
+#
+# class ProcessTupleOperator(UDFOperatorV2):
+#
+# @overrides
+# def process_tuple(self, tuple_: Tuple, port: int) ->
Iterator[Optional[TupleLike]]:
+# yield tuple_
+#
+# class ProcessBatchOperator(UDFBatchOperator):
+# BATCH_SIZE = 10 # must be a positive integer
+#
+# @overrides
+# def process_batch(self, batch: Batch, port: int) ->
Iterator[Optional[BatchLike]]:
+# yield batch
+#
+# class ProcessTableOperator(UDFTableOperator):
+#
+# @overrides
+# def process_table(self, table: Table, port: int) ->
Iterator[Optional[TableLike]]:
+# yield table
+`;
+
+//
https://github.com/apache/texera/blob/main/amber/src/main/python/core/models/tuple.py
+export const TUPLE_DOCUMENTATION = `
+### **<code>Tuple</code> Class Overview**
+
+The \`Tuple\` class is a **lazy-evaluated** data structure designed for
efficient field storage and access. It provides:
+
+ 1. **Support for Multiple Data Sources**:
+* Can be initialized from a \`TupleLike\` object, such as a \`pandas.Series\`,
\`OrderedDict\`, or another \`Tuple\` instance.
+* Works with \`ArrowTableTupleProvider\` to access \`pyarrow.Table\` data.
+2. **Lazy Field Evaluation**:
+* Field values can be either **directly stored values** or **lazy accessors**
(\`field_accessor\`).
+* If a field is accessed and is an accessor, it is evaluated and cached.
+3. **Schema (<code>Schema</code>) Enforcement**:
+ * A \`Tuple\` can be created without a schema but can be **finalized** with
one using \`finalize(schema)\`, which:
+* **Casts field values** (e.g., \`NaN → None\`, \`Object → Bytes\`).
+* **Validates field completeness**, ensuring all fields match the \`Schema\`.
+4. **Pythonic Access Patterns**:
+* **Index-based access**: \`tuple["field_name"]\` or \`tuple[index]\`
retrieves field values.
+* **Dictionary-like operations**: \`tuple.as_dict()\` returns an
\`OrderedDict\`, and \`tuple.as_series()\` converts to a \`pandas.Series\`.
+* **Iterable support**: \`for field in tuple\` iterates over field values.
+5. **Hashing and Comparisons**:
+* Implements \`__hash__\` using a Java-like hashing algorithm, allowing usage
as dictionary keys.
+* Implements \`__eq__\`, supporting equality checks based on field contents.
+6. **Partial Data Extraction**:
+* \`tuple.get_partial_tuple(attribute_names)\` returns a new \`Tuple\`
instance containing only the specified fields.
+`;
+
+//
https://github.com/apache/texera/blob/main/amber/src/main/python/core/models/table.py
+export const TABLE_DOCUMENTATION = `### **<code>Table</code> Class Overview**
+
+The \`Table\` class extends \`pandas.DataFrame\`, providing **structured
Tuple-based data management**. It is designed to integrate seamlessly with
\`Tuple\` objects.
+
+#### **Key Features:**
+
+1. **Flexible Construction:**
+* Can be initialized from various sources:
+* Another \`Table\` (\`from_table(table)\`)
+* A \`pandas.DataFrame\` (\`from_data_frame(df)\`)
+* A list/iterator of \`TupleLike\` objects (\`from_tuple_likes(tuple_likes)\`)
+* Ensures all \`Tuple\` objects in a \`Table\` have **consistent field names**.
+2. **Tuple Conversion:**
+* \`as_tuples()\`: Converts the table rows into an **iterator of
<code>Tuple</code> instances**, preserving the row order.
+3. **Equality Comparison (<code>__eq__</code>):**
+* Supports **row-wise equality checks** by comparing the underlying \`Tuple\`
objects.
+4. **Universal Tuple Output (<code>all_output_to_tuple</code>):**
+* A helper function to convert **various data types** into \`Tuple\`
iterators, supporting:
+* \`None\` → \`[None]\`
+* \`Table\` → \`as_tuples()\`
+* \`pandas.DataFrame\` → Converted into a \`Table\`, then to Tuples
+* \`List[TupleLike]\` → Converted to \`Tuple\` instances
+* A single \`TupleLike\` or \`Tuple\` → Wrapped in an iterator
+
+#### **Relation to <code>Tuple</code>:**
+
+* \`Table\` **stores multiple <code>Tuple</code> objects** and ensures schema
consistency across rows.
+* Provides an **efficient bridge** between \`Tuple\`-based data and
\`pandas.DataFrame\`, enabling compatibility with Python's data analysis tools.
+`;
+
+//
https://github.com/apache/texera/blob/main/amber/src/main/python/core/models/operator.py
+export const OPERATOR_DOCUMENTATION = `### **Operator Class Overview**
+
+The \`Operator\` class is an **abstract base class (ABC)** for all operators,
defining the fundamental structure for processing \`Tuple\`, \`Batch\`, and
\`Table\` data in a workflow.
+
+#### **Key Features & Hierarchy**
+
+1. **Base <code>Operator</code> Class**:
+* Defines lifecycle methods: \`open()\` and \`close()\`.
+* Supports a **source flag (<code>is_source</code>)** to distinguish source
operators from others.
+2. **Tuple-Based Processing (<code>TupleOperatorV2</code>)**:
+* Processes individual \`Tuple\` objects through \`process_tuple(tuple_,
port)\`.
+* Calls \`on_finish(port)\` when an input port is exhausted.
+3. **Types of Operators**:
+* **SourceOperator**:
+* Produces data via \`produce()\`, yielding \`TupleLike\` or \`TableLike\`
objects.
+* Overrides \`on_finish(port)\` to output produced data.
+* **BatchOperator**:
+* Collects tuples into batches (\`BATCH_SIZE\`) before processing via
\`process_batch(batch, port)\`.
+* Converts processed batches (typically \`pandas.DataFrame\`) into \`Tuple\`
output.
+* **TableOperator**:
+* Collects tuples into a \`Table\` before processing via
\`process_table(table, port)\`.
+* Converts processed \`Table\` output back into tuples.
+4. **Data Flow & Processing**:
+* Operators receive data **tuple-by-tuple**, **batch-by-batch**, or
**table-by-table** depending on the type.
+* Results are **iterators** of transformed data (\`TupleLike\`, \`BatchLike\`,
or \`TableLike\`).
+5. **Deprecated <code>TupleOperator</code>**:
+* The older version of \`TupleOperator\` is deprecated in favor of
\`TupleOperatorV2\`.
+
+#### Relation to <code>Tuple</code> and <code>Table</code>
+
+* Operators **consume and transform** \`Tuple\` and \`Table\` data within a
workflow.
+* **Tuple-based operators** process row-wise, while **Table operators** handle
structured table transformations.
+* **Source operators** initiate the data flow by generating tuples or tables.`;
+
+export const UDF_INPUT_PORT_DOCUMENTATION = `
+Python UDF operators support multiple input and output ports, allowing a
single operator to receive different types of data from various upstream
operators. In the process_tuple(self, tuple_: Tuple, port: int) function in
ProcessTupleOperator and the process_table(self, table: Table, port: int)
function in ProcessTableOperator, the port parameter indicates the input port.
The port numbers are assigned in order, starting from 0 to N, from top to
bottom. When input data have different sche [...]
+
+Using this knowledge, for situations where multiple upstream UDFs act as input
to a single UDF, we can introduce an intermediary UDF that collects all of the
input data and reformats it into a single table, which is then passed as input
to the original next downstream UDF. When it is necessary for this to occur in
your translation from notebook to UDFs, include the intermediary UDF and make
sure that it and the next operator that uses its output is formatted correctly
and handles the dat [...]
+`;
+
+export const EXAMPLE_OF_GOOD_CONVERSION = `
+Here is an example of python code translated into a compatible Texera UDF that
gives output that abides the output schema compatible with the Texera workflow
operators for tuples. Other operators do not always follow this strict format,
but the yielding output structure is important.
+
+Python Code (high level idea): We have a python code that given some data, we
limit the number of data.
+
+Texera Operator code:
+from pytexera import *
+
+class ProcessTupleOperator(UDFOperatorV2):
+def __init__(self):
+self.limit = 10
+self.count = 0
+@overrides
+def process_tuple(self, tuple_: Tuple, port: int) ->
Iterator[Optional[TupleLike]]:
+if(self.count < self.limit):
+self.count += 1
+yield tuple_
+
+`;
+
+export const VISUALIZER_DOCUMENTATION = `
+Texera requires a unique way of generating visualizations from ML libraries:
+1. Ensures one yield per operator (per Texera’s UDF constraints).
+2. Uses Plotly for visualization and outputs results as embeddable HTML.
+3. Error handling is built-in to notify users when data is missing.
+`;
+
+export const EXAMPLE_OF_MULTIPLE_UDF_CONVERSION = `
+Here is an example of breaking up python code into multiple Texera UDFs.
Format your response structure exactly like the given example. The "code" key
contains a dictionary of the UDF ID's with their respective code. The "edges"
key contains a list of pairs that contains the connections between UDFs. The
"outputs" key contains a dictionary of the UDF ID's with a list of the output
column names of the DataFrame that the UDF yields. The UDFs can branch and
merge, it does not have to be a l [...]
+
+Original Code:
+\`\`\`python
+# START CELL1
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.svm import SVC
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import accuracy_score
+from sklearn.preprocessing import StandardScaler
+import matplotlib.pyplot as plt
+# END CELL1
+
+# START CELL2
+# Load the dataset
+file_path = 'diabetes.csv'
+data = pd.read_csv(file_path)
+# END CELL2
+
+# START CELL3
+# Remove duplicate rows
+data = data.drop_duplicates()
+
+# Remove rows with null values
+data = data.dropna()
+# END CELL3
+
+# START CELL4
+# Print the minimum, maximum, and mean for all fields
+print("Minimum values:\n", data.min())
+print("\nMaximum values:\n", data.max())
+print("\nMean values:\n", data.mean())
+# END CELL4
+
+# START CELL5
+# Create a boxplot for the 'Pregnancies' field
+plt.figure(figsize=(8, 6))
+plt.boxplot(data['Pregnancies'], vert=False, patch_artist=True)
+plt.title('Boxplot of Pregnancies')
+plt.xlabel('Number of Pregnancies')
+plt.show()
+# END CELL5
+
+# START CELL6
+# Separate features and target variable
+X = data.drop('Outcome', axis=1)
+y = data['Outcome']
+# END CELL6
+
+# START CELL7
+# Split data into training and testing sets (80% train, 20% test)
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=42)
+
+scaler = StandardScaler()
+X_train = scaler.fit_transform(X_train)
+X_test = scaler.transform(X_test)
+# END CELL7
+
+# START CELL8
+# Train Random Forest model
+rf_model = RandomForestClassifier(random_state=42)
+rf_model.fit(X_train, y_train)
+rf_pred = rf_model.predict(X_test)
+rf_accuracy = accuracy_score(y_test, rf_pred)
+print(f"Random Forest Accuracy: {rf_accuracy:.2%}")
+# END CELL8
+
+# START CELL9
+# Train SVM model
+svm_model = SVC(random_state=42)
+svm_model.fit(X_train, y_train)
+svm_pred = svm_model.predict(X_test)
+svm_accuracy = accuracy_score(y_test, svm_pred)
+print(f"SVM Accuracy: {svm_accuracy:.2%}")
+# END CELL9
+\`\`\`
+
+Texera UDF conversion:
+\`\`\`json
+{
+ "code": {
+ "UDF1": "# UDF1\nfrom pytexera import *\nimport pandas as pd\nfrom
typing import Iterator, Optional\n\nclass
ProcessTableOperator(UDFTableOperator):\n\n @overrides\n def
process_table(self, table: Table, port: int) ->
Iterator[Optional[TableLike]]:\n # Remove duplicate rows\n data =
table.drop_duplicates()\n\n # Remove rows with null values\n data
= data.dropna()\n\n # Calculate statistics\n min_values =
data.min()\n max_valu [...]
+ "UDF2": "# UDF2\nfrom pytexera import *\nimport pandas as pd\nimport
plotly.express as px\nimport plotly.io\nfrom typing import Iterator,
Optional\n\nclass ProcessTableOperator(UDFTableOperator):\n def
render_error(self, error_msg):\n return '''<h1>Boxplot is not
available.</h1>\n <p>Reason is: {} </p>\n
'''.format(error_msg)\n\n @overrides\n def process_table(self, table:
Table, port: int) -> Iterator[Optional[TableLike]]:\n [...]
+ "UDF3": "# UDF3\nfrom pytexera import *\nimport pandas as pd\nfrom
sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing
import StandardScaler\nfrom typing import Iterator, Optional\n\nclass
ProcessTableOperator(UDFTableOperator):\n\n @overrides\n def
process_table(self, table: Table, port: int) ->
Iterator[Optional[TableLike]]:\n data = table['data'].iloc[0]\n\n
# Separate features and target variable\n X = data.drop('Outcome', ax
[...]
+ "UDF4": "# UDF4\nfrom pytexera import *\nimport pandas as pd\nfrom
sklearn.ensemble import RandomForestClassifier\nfrom sklearn.metrics import
accuracy_score\nfrom typing import Iterator, Optional\n\nclass
ProcessTableOperator(UDFTableOperator):\n\n @overrides\n def
process_table(self, table: Table, port: int) ->
Iterator[Optional[TableLike]]:\n X_train = table['X_train'].iloc[0]\n
y_train = table['y_train'].iloc[0]\n X_test =
table['X_test'].iloc[0]\n [...]
+ "UDF5": "# UDF5\nfrom pytexera import *\nimport pandas as pd\nfrom
sklearn.svm import SVC\nfrom sklearn.metrics import accuracy_score\nfrom typing
import Iterator, Optional\n\nclass ProcessTableOperator(UDFTableOperator):\n\n
@overrides\n def process_table(self, table: Table, port: int) ->
Iterator[Optional[TableLike]]:\n X_train = table['X_train'].iloc[0]\n
y_train = table['y_train'].iloc[0]\n X_test =
table['X_test'].iloc[0]\n y_test = table['y [...]
+ },
+ "edges": [
+ ["UDF1", "UDF2"],
+ ["UDF1", "UDF3"],
+ ["UDF3", "UDF4"],
+ ["UDF3", "UDF5"]
+ ],
+ "outputs": {
+ "UDF1": ["min_values", "max_values", "mean_values", "data"],
+ "UDF2": ["html-content"],
+ "UDF3": ["X_train", "X_test", "y_train", "y_test"],
+ "UDF4": ["rf_model", "rf_accuracy", "X_test", "y_test"],
+ "UDF5": ["svm_model", "svm_accuracy", "X_test", "y_test"]
+ }
+}
+\`\`\`
+`;
+
+export const WORKFLOW_PROMPT = `You are an expert in Python coding and
workflow systems.
+Many users of Texera system are non-technical, but the notebooks they provide
are written by technical people.
+They want to convert their notebooks to Texera workflows.
+Your goal is to help convert these notebooks into a Texera workflow that
non-technical users can use directly.
+So do not remove or modify any classes or functions, preserve their names and
structure as they are.
+Ensure that all essential logic remains intact.
+Create multiple Texera UDF codes using the provided Python code.
+Number each UDF, starting at 1 and incrementing, by starting with a comment
that states that UDF number.
+
+Use the class and function names as shown in ProcessTupleOperator,
ProcessTableOperator, and ProcessBatchOperator.
+Do not change the class names, function names, or input parameters.
+Use the ones that make sense and split the code meaningfully as instructed.
+
+Use the starter code provided for Python UDFs.
+
+Use the documentation of Table, Tuple, or Batch to work with parameters within
Texera UDF.
+Do not import other libraries to define these types.
+
+There is no need for an __init__ function. Assume all inputs are valid pandas
DataFrames,
+so do not use .to_pandas(), .to_dataframe(), etc. Do not load data from a file
in the first UDF;
+the workflow's source operator supplies the initial data, so assume it is
already given to you in the
+table parameter. Replacing file-loading code with this input is the one
exception to preserving all
+original code (see below).
+Ensure proper data flow between functions. Separate operators as if they will
run in different files.
+
+Current UDF operators can only have one output. Build a dataframe to yield all
necessary variables
+and data. Ensure proper data flow for each UDF and all information is yielded
(including training
+and testing data) if subsequent UDFs need them.
+
+Ensure all necessary imports are included in each UDF code block.
+
+Each UDF operator should be in its own Python code block. Do not combine them
into a single block.
+Ensure import statements cover all used functions and separate them as
necessary.
+
+It is VERY important that all of the original code in the Jupyter notebook is
represented in the generated workflow.
+Make sure that nothing in the original is removed and that the semantic
meaning of what the original code was doing is retained.
+The only exception is data-loading code (e.g. pd.read_csv); it is represented
by the workflow's input/source operator rather than copied into a UDF.
+If there are user-defined Python classes, include the entire class definition
in the appropriate UDF(s) that use that class.
+Always include the code that defines the class inside of every distinct UDF
that uses that constructs an object of that class.
+Python classes are allowed in Texera UDFs and follow the same semantics as
standard Python.
+They can be defined outside of ProcessTableOperator, ProcessTupleOperator, and
ProcessBatchOperator.
+
+Return only the JSON formatted response, do not give any explanation.
+Do not wrap the JSON in markdown code fences. Output raw JSON only.
+Make sure the response is a valid JSON structure, including closing all braces
and not including commas after the last element.
+Follow this JSON format (don't reuse the values, this is just the format).
'code', 'edges', and 'outputs' are all their own key's, do not nest any of
these in another one and make sure to close their braces:
+{
+"code": {
+"UDF1": "code for UDF1 goes here",
+"UDF2": "code for UDF2 goes here"
+},
+"edges": [
+["UDF1", "UDF2"]
+],
+"outputs": {
+"UDF1": ["min_values", "max_values", "mean_values", "data"],
+"UDF2": ["html-content"]
+}
+}
+Make sure only the keys in the code section appear in the edges and outputs
sections. Do not include any extraneous fields.
+Do not include any extraneous UDF's in the code field that include empty
strings.
+Give ALL of the code, do not omit anything or use placeholders for code. Make
sure ALL code in the original is translated over.
+The value of each UDF must be a valid JSON string: escape newlines, quotes,
and backslashes correctly so that the decoded string is runnable Python. Use
whichever quotes the Python code requires.
+Convert following the instructions and examples given. Here is the code:
+`;
+
+export const MAPPING_PROMPT = `
+Here is an example of a mapping generated between the given example Python
code and the Texera UDFs using their CELL and UDF IDs. Cell IDs are designated
by the UUID following '# START'. The format should be kept the same.
+{
+"UDF1": [
+"CELL3",
+"CELL4"
+],
+"UDF2": [
+"CELL5"
+],
+"UDF3": [
+"CELL6",
+"CELL7"
+],
+"UDF4": [
+"CELL8"
+]
+}
+Now create a mapping for the UDFs and the original code. Link the code blocks
marked by 'START <cell-uuid>' and 'END <cell-uuid>' with the UDF UUID's. The
code between them should be equivalent. Multiple cells can be mapped to the
same UDF when that UDF implements the logic of those cells. There could be any
number of cells and UDFs, so only create the correct number in the mapping.
Only give the mapping.
+`;
diff --git a/frontend/yarn.lock b/frontend/yarn.lock
index 694ac59382..36ee0fa3cb 100644
--- a/frontend/yarn.lock
+++ b/frontend/yarn.lock
@@ -30,6 +30,18 @@ __metadata:
languageName: node
linkType: hard
+"@ai-sdk/openai@npm:2.0.67":
+ version: 2.0.67
+ resolution: "@ai-sdk/openai@npm:2.0.67"
+ dependencies:
+ "@ai-sdk/provider": "npm:2.0.0"
+ "@ai-sdk/provider-utils": "npm:3.0.17"
+ peerDependencies:
+ zod: ^3.25.76 || ^4.1.8
+ checksum:
10c0/7e5c407504d7902c17c816aaccd83f642a3b82012cd8467c8f58aef5f08a49b6c31fff775439d541d40b0c8b5b94cc384f18096d1968e23670e22a56fe82d8bd
+ languageName: node
+ linkType: hard
+
"@ai-sdk/provider-utils@npm:3.0.17":
version: 3.0.17
resolution: "@ai-sdk/provider-utils@npm:3.0.17"
@@ -10617,6 +10629,7 @@ __metadata:
resolution: "gui@workspace:."
dependencies:
"@abacritt/angularx-social-login": "npm:2.3.0"
+ "@ai-sdk/openai": "npm:2.0.67"
"@ali-hm/angular-tree-component": "npm:12.0.5"
"@angular-builders/custom-webpack": "npm:21.0.3"
"@angular-devkit/build-angular": "npm:21.2.8"