(texera) branch main updated: feat(python-notebook-migration): add LLM client for notebook-to-workflow conversion (#5260)

github-bot Thu, 25 Jun 2026 11:55:30 -0700

This is an automated email from the ASF dual-hosted git repository.

github-merge-queue[bot] pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/texera.git



The following commit(s) were added to refs/heads/main by this push:
     new 94da3d9387 feat(python-notebook-migration): add LLM client for 
notebook-to-workflow conversion (#5260)
94da3d9387 is described below

commit 94da3d93875f63179fa0ae92d4936155dffba68c
Author: Ryan Zhang <[email protected]>
AuthorDate: Thu Jun 25 11:42:58 2026 -0700

    feat(python-notebook-migration): add LLM client for notebook-to-workflow 
conversion (#5260)
    
    ### What changes were proposed in this PR?
    Introduces the frontend LLM session class that converts a Jupyter
    notebook into a Texera workflow JSON plus a bidirectional cell to
    operator mapping, along with the prompt library it uses. Two files under
    `frontend/src/app/workspace/service/notebook-migration/`, totalling ~700
    lines (~410 of which is prompt text).
    
    **`migration-llm.ts`** — defines `NotebookMigrationLLM`, an
    `@Injectable` class wrapping a Vercel AI SDK chat session against the
    LiteLLM proxy already exposed on `main` at `/api/chat/completion`.
    - `initialize(modelType, apiKey)` — builds an OpenAI-compatible chat
    client via `createOpenAI({ baseURL: AppSettings.getApiEndpoint() })`,
    seeds the message history with Texera documentation as `system`
    messages.
    - `verifyConnection()` — does a 10-token `ping` call to validate that
    the API key works against the configured model.
    - `convertNotebookToWorkflow(notebook)` — extracts code cells (each
    tagged with a UUID in `metadata.uuid`), sends `WORKFLOW_PROMPT` + the
    notebook to get a JSON of UDF operators / edges, then sends
    `MAPPING_PROMPT` to get the cell↔operator mapping. Assembles a complete
    Texera workflow JSON (`PythonUDFV2` operators with stub input/output
    ports, links derived from the LLM's edge list, default settings) plus a
    bidirectional `operator_to_cell` / `cell_to_operator` mapping. Returns
    both as a JSON string.
      - `close()` — clears the message history and the model reference.
    
    **`migration-prompts.ts`** — string constants used by
    `migration-llm.ts`: `TEXERA_OVERVIEW`, `TUPLE_DOCUMENTATION`,
    `TABLE_DOCUMENTATION`, `OPERATOR_DOCUMENTATION`,
    `UDF_INPUT_PORT_DOCUMENTATION`, `EXAMPLE_OF_GOOD_CONVERSION`,
    `VISUALIZER_DOCUMENTATION`, `EXAMPLE_OF_MULTIPLE_UDF_CONVERSION`,
    `WORKFLOW_PROMPT`, `MAPPING_PROMPT`.
    
    ### Any related issues, documentation, discussions?
    Closes #5259
    Parent issue #4301
    
    
    ### How was this PR tested?
    No unit tests were included for these reasons:
    - A large portion of the changes are prompt text, which are not
    testable, only readable. However the prompt text can be changed to
    improve the performance of the LLM.
    - Testing would require mocking a significant amount of logic that will
    be introduced in later PRs, since the logic in `migration-llm.ts` is
    parsing a response.
    
    However I am open to writing tests based on review feedback.
    
    
    ### Was this PR authored or co-authored using generative AI tooling?
    Generated-by: Claude Code (Claude Opus 4.7)
    
    ---------
    
    Co-authored-by: Meng Wang <[email protected]>
---
 frontend/package.json                              |   1 +
 .../notebook-migration/migration-llm.spec.ts       | 306 +++++++++++++++
 .../service/notebook-migration/migration-llm.ts    | 367 ++++++++++++++++++
 .../notebook-migration/migration-prompts.ts        | 414 +++++++++++++++++++++
 frontend/yarn.lock                                 |  13 +
 5 files changed, 1101 insertions(+)

diff --git a/frontend/package.json b/frontend/package.json
index 78f2d10355..418b166ee8 100644
--- a/frontend/package.json
+++ b/frontend/package.json
@@ -21,6 +21,7 @@
   "private": true,
   "dependencies": {
     "@abacritt/angularx-social-login": "2.3.0",
+    "@ai-sdk/openai": "2.0.67",
     "@ali-hm/angular-tree-component": "12.0.5",
     "@angular/animations": "21.2.10",
     "@angular/cdk": "21.2.8",
diff --git 
a/frontend/src/app/workspace/service/notebook-migration/migration-llm.spec.ts 
b/frontend/src/app/workspace/service/notebook-migration/migration-llm.spec.ts
new file mode 100644
index 0000000000..58c17cdfc3
--- /dev/null
+++ 
b/frontend/src/app/workspace/service/notebook-migration/migration-llm.spec.ts
@@ -0,0 +1,306 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import { NotebookMigrationLLM, Notebook } from "./migration-llm";
+import { GuiConfigService } from "../../../common/service/gui-config.service";
+import { WorkflowUtilService } from 
"../workflow-graph/util/workflow-util.service";
+import { generateText } from "ai";
+import type { Mock } from "vitest";
+
+// The LLM transport and OpenAI client are mocked so the tests exercise only 
the
+// deterministic transformation (parsing, operator/edge construction, 
cell<->operator mapping).
+vi.mock("ai", () => ({ generateText: vi.fn() }));
+vi.mock("@ai-sdk/openai", () => ({
+  createOpenAI: vi.fn(() => ({ chat: vi.fn(() => ({})) })),
+}));
+
+const mockGenerateText = generateText as unknown as Mock;
+
+describe("NotebookMigrationLLM", () => {
+  let opIdCounter = 0;
+  let stubUtil: WorkflowUtilService;
+
+  // Build a fresh, initialized session with stubbed dependencies. The stubbed
+  // getNewOperatorPredicate hands out deterministic ids (PythonUDFV2-0, -1, 
...).
+  function makeLLM(): NotebookMigrationLLM {
+    const stubConfig = {
+      env: {
+        pythonNotebookMigrationEnabled: true,
+        defaultDataTransferBatchSize: 400,
+        defaultExecutionMode: "PIPELINED",
+      },
+    } as unknown as GuiConfigService;
+
+    stubUtil = {
+      getNewOperatorPredicate: vi.fn((operatorType: string, 
customDisplayName?: string) => ({
+        operatorID: `${operatorType}-${opIdCounter++}`,
+        operatorType,
+        operatorVersion: "test-version",
+        operatorProperties: { workers: 1, defaultEnv: true, envName: "" },
+        inputPorts: [{ portID: "input-0", disallowMultiInputs: false }],
+        outputPorts: [{ portID: "output-0" }],
+        showAdvanced: false,
+        isDisabled: false,
+        customDisplayName,
+        dynamicInputPorts: true,
+        dynamicOutputPorts: true,
+      })),
+    } as unknown as WorkflowUtilService;
+
+    const llm = new NotebookMigrationLLM(stubConfig, stubUtil);
+    // Pass an explicit token so tests don't depend on 
AuthService/localStorage state.
+    llm.initialize("gpt-5-mini", "test-token");
+    return llm;
+  }
+
+  function codeCell(uuid: string | undefined, source: string) {
+    return { cell_type: "code", metadata: uuid === undefined ? {} : { uuid }, 
source };
+  }
+
+  // Queue the two responses convertNotebookToWorkflow consumes, in order.
+  function mockResponses(workflowResponse: string, mappingResponse: string) {
+    mockGenerateText.mockResolvedValueOnce({ text: workflowResponse 
}).mockResolvedValueOnce({ text: mappingResponse });
+  }
+
+  beforeEach(() => {
+    opIdCounter = 0;
+    mockGenerateText.mockReset();
+  });
+
+  describe("convertNotebookToWorkflow", () => {
+    it("builds operators, links, positions, and a bidirectional mapping", 
async () => {
+      const notebook: Notebook = {
+        cells: [codeCell("CELL1", "print(1)"), codeCell("CELL2", "print(2)")],
+      };
+      mockResponses(
+        JSON.stringify({
+          code: { UDF1: "code1", UDF2: "code2" },
+          edges: [["UDF1", "UDF2"]],
+          outputs: { UDF1: ["a", "b"], UDF2: ["c"] },
+        }),
+        JSON.stringify({ UDF1: ["CELL1"], UDF2: ["CELL2"] })
+      );
+
+      const { workflowJSON, workflowNotebookMapping } = JSON.parse(await 
makeLLM().convertNotebookToWorkflow(notebook));
+
+      expect(workflowJSON.operators.map((op: any) => 
op.operatorID)).toEqual(["PythonUDFV2-0", "PythonUDFV2-1"]);
+      expect(workflowJSON.operators[0].operatorProperties).toMatchObject({
+        code: "code1",
+        retainInputColumns: false,
+      });
+      expect(workflowJSON.operatorPositions).toEqual({
+        "PythonUDFV2-0": { x: 140, y: 0 },
+        "PythonUDFV2-1": { x: 280, y: 0 },
+      });
+      expect(workflowJSON.links).toHaveLength(1);
+      expect(workflowJSON.links[0].source).toEqual({ operatorID: 
"PythonUDFV2-0", portID: "output-0" });
+      expect(workflowJSON.links[0].target).toEqual({ operatorID: 
"PythonUDFV2-1", portID: "input-0" });
+      expect(workflowNotebookMapping.operator_to_cell).toEqual({
+        "PythonUDFV2-0": ["CELL1"],
+        "PythonUDFV2-1": ["CELL2"],
+      });
+      expect(workflowNotebookMapping.cell_to_operator).toEqual({
+        CELL1: ["PythonUDFV2-0"],
+        CELL2: ["PythonUDFV2-1"],
+      });
+      // Settings come from GUI config defaults, not hardcoded values.
+      expect(workflowJSON.settings).toEqual({ dataTransferBatchSize: 400, 
executionMode: "PIPELINED" });
+    });
+
+    // Intermediate UDFs (a source of some edge) keep "binary" for object 
passing; terminal
+    // UDFs (no outgoing edge) default to "string" so the result panel renders 
typed values.
+    it("types intermediate UDF outputs as binary and terminal UDF outputs as 
string", async () => {
+      const notebook: Notebook = { cells: [codeCell("CELL1", "a"), 
codeCell("CELL2", "b")] };
+      mockResponses(
+        JSON.stringify({
+          code: { UDF1: "code1", UDF2: "code2" },
+          edges: [["UDF1", "UDF2"]],
+          outputs: { UDF1: ["x"], UDF2: ["y"] },
+        }),
+        JSON.stringify({ UDF1: ["CELL1"], UDF2: ["CELL2"] })
+      );
+
+      const { workflowJSON } = JSON.parse(await 
makeLLM().convertNotebookToWorkflow(notebook));
+
+      // UDF1 is a source (intermediate) -> binary; UDF2 is terminal -> string.
+      
expect(workflowJSON.operators[0].operatorProperties.outputColumns).toEqual([
+        { attributeName: "x", attributeType: "binary" },
+      ]);
+      
expect(workflowJSON.operators[1].operatorProperties.outputColumns).toEqual([
+        { attributeName: "y", attributeType: "string" },
+      ]);
+    });
+
+    it("maps multiple cells onto the same UDF, and one cell onto multiple 
UDFs", async () => {
+      const notebook: Notebook = {
+        cells: [codeCell("CELL1", "a"), codeCell("CELL2", "b")],
+      };
+      mockResponses(
+        JSON.stringify({ code: { UDF1: "c1", UDF2: "c2" }, edges: [], outputs: 
{} }),
+        JSON.stringify({ UDF1: ["CELL1", "CELL2"], UDF2: ["CELL1"] })
+      );
+
+      const { workflowNotebookMapping } = JSON.parse(await 
makeLLM().convertNotebookToWorkflow(notebook));
+
+      expect(workflowNotebookMapping.operator_to_cell).toEqual({
+        "PythonUDFV2-0": ["CELL1", "CELL2"],
+        "PythonUDFV2-1": ["CELL1"],
+      });
+      expect(workflowNotebookMapping.cell_to_operator).toEqual({
+        CELL1: ["PythonUDFV2-0", "PythonUDFV2-1"],
+        CELL2: ["PythonUDFV2-0"],
+      });
+    });
+
+    it("skips (with a warning) an edge that references an unknown UDF id", 
async () => {
+      const warn = vi.spyOn(console, "warn").mockImplementation(() => {});
+      const notebook: Notebook = { cells: [codeCell("CELL1", "a")] };
+      mockResponses(
+        JSON.stringify({ code: { UDF1: "c1" }, edges: [["UDF1", "UDFX"]], 
outputs: {} }),
+        JSON.stringify({ UDF1: ["CELL1"] })
+      );
+
+      const { workflowJSON } = JSON.parse(await 
makeLLM().convertNotebookToWorkflow(notebook));
+
+      // The dangling edge is dropped rather than producing an undefined 
endpoint.
+      expect(workflowJSON.links).toEqual([]);
+      expect(warn).toHaveBeenCalledWith(expect.stringContaining("UDFX"));
+      warn.mockRestore();
+    });
+
+    it("skips (with a warning) a mapping entry that references an unknown UDF 
id", async () => {
+      const warn = vi.spyOn(console, "warn").mockImplementation(() => {});
+      const notebook: Notebook = { cells: [codeCell("CELL1", "a")] };
+      mockResponses(
+        JSON.stringify({ code: { UDF1: "c1" }, edges: [], outputs: {} }),
+        JSON.stringify({ UDF1: ["CELL1"], UDFTYPO: ["CELL1"] })
+      );
+
+      const { workflowNotebookMapping } = JSON.parse(await 
makeLLM().convertNotebookToWorkflow(notebook));
+
+      // Only the valid UDF id survives in the mapping.
+      expect(workflowNotebookMapping.operator_to_cell).toEqual({ 
"PythonUDFV2-0": ["CELL1"] });
+      expect(workflowNotebookMapping.cell_to_operator).toEqual({ CELL1: 
["PythonUDFV2-0"] });
+      expect(warn).toHaveBeenCalledWith(expect.stringContaining("UDFTYPO"));
+      warn.mockRestore();
+    });
+
+    it("handles empty code, edges, and outputs", async () => {
+      const notebook: Notebook = { cells: [] };
+      mockResponses(JSON.stringify({ code: {}, edges: [], outputs: {} }), 
JSON.stringify({}));
+
+      const { workflowJSON, workflowNotebookMapping } = JSON.parse(await 
makeLLM().convertNotebookToWorkflow(notebook));
+
+      expect(workflowJSON.operators).toEqual([]);
+      expect(workflowJSON.links).toEqual([]);
+      expect(workflowNotebookMapping.operator_to_cell).toEqual({});
+      expect(workflowNotebookMapping.cell_to_operator).toEqual({});
+    });
+
+    it("rejects when a code cell is missing metadata.uuid", async () => {
+      const notebook: Notebook = { cells: [codeCell(undefined, "print(1)")] };
+
+      await 
expect(makeLLM().convertNotebookToWorkflow(notebook)).rejects.toThrow(/metadata\.uuid/);
+      // It fails before prompting, so the LLM is never called.
+      expect(mockGenerateText).not.toHaveBeenCalled();
+    });
+
+    it("joins array-form cell source (nbformat lines) without inserting 
commas", async () => {
+      const notebook: Notebook = {
+        cells: [
+          {
+            cell_type: "code",
+            metadata: { uuid: "CELL1" },
+            source: ["import pandas as pd\n", "x = 1\n"],
+          },
+        ],
+      };
+      mockResponses(
+        JSON.stringify({ code: { UDF1: "c1" }, edges: [], outputs: {} }),
+        JSON.stringify({ UDF1: ["CELL1"] })
+      );
+
+      await makeLLM().convertNotebookToWorkflow(notebook);
+
+      const allPromptContent = mockGenerateText.mock.calls
+        .flatMap(call => call[0].messages.map((m: any) => m.content))
+        .join("\n");
+      expect(allPromptContent).toContain("import pandas as pd\nx = 1\n");
+      expect(allPromptContent).not.toContain("import pandas as pd\n,");
+    });
+
+    it("resets conversation history between conversions so a prior notebook 
does not leak", async () => {
+      const llm = makeLLM();
+
+      // First conversion (notebook AAA) on the instance.
+      mockResponses(
+        JSON.stringify({ code: { UDF1: "codeAAA" }, edges: [], outputs: {} }),
+        JSON.stringify({ UDF1: ["AAA"] })
+      );
+      await llm.convertNotebookToWorkflow({ cells: [codeCell("AAA", "a = 1")] 
});
+
+      // Second conversion (notebook BBB) on the SAME instance, no 
close()/initialize() between.
+      mockResponses(
+        JSON.stringify({ code: { UDF1: "codeBBB" }, edges: [], outputs: {} }),
+        JSON.stringify({ UDF1: ["BBB"] })
+      );
+      await llm.convertNotebookToWorkflow({ cells: [codeCell("BBB", "b = 2")] 
});
+
+      // The 3rd generateText call is the workflow prompt of the second 
conversion.
+      const secondConversionMessages = 
mockGenerateText.mock.calls[2][0].messages.map((m: any) => 
m.content).join("\n");
+
+      expect(secondConversionMessages).toContain("# START BBB");
+      expect(secondConversionMessages).not.toContain("AAA");
+      expect(secondConversionMessages).not.toContain("codeAAA");
+    });
+  });
+
+  describe("parseJsonResponse", () => {
+    // parseJsonResponse is private; cast to access it directly for focused 
coverage.
+    const parse = (raw: string) => (makeLLM() as any).parseJsonResponse(raw, 
"workflow");
+
+    it("parses bare JSON", () => {
+      expect(parse('{"a":1}')).toEqual({ a: 1 });
+    });
+
+    it("strips a ```json fence", () => {
+      expect(parse('```json\n{"a":1}\n```')).toEqual({ a: 1 });
+    });
+
+    it("strips a plain ``` fence", () => {
+      expect(parse('```\n{"a":1}\n```')).toEqual({ a: 1 });
+    });
+
+    it("tolerates surrounding whitespace and newlines around the fence", () => 
{
+      expect(parse('\n\n  ```json\n{"a":1}\n```  \n\n')).toEqual({ a: 1 });
+    });
+
+    it("throws a contextual error on malformed JSON", () => {
+      expect(() => parse("not json")).toThrow("Failed to parse LLM workflow 
response as JSON");
+    });
+
+    it("extracts fenced JSON even when surrounded by prose", () => {
+      expect(parse('Here is the JSON: 
```json\n{"a":1}\n```\nThanks!')).toEqual({ a: 1 });
+    });
+
+    it("extracts the outermost object from fence-less prose", () => {
+      expect(parse('Sure! {"a":1} hope that helps')).toEqual({ a: 1 });
+    });
+  });
+});
diff --git 
a/frontend/src/app/workspace/service/notebook-migration/migration-llm.ts 
b/frontend/src/app/workspace/service/notebook-migration/migration-llm.ts
new file mode 100644
index 0000000000..2922c3ee0e
--- /dev/null
+++ b/frontend/src/app/workspace/service/notebook-migration/migration-llm.ts
@@ -0,0 +1,367 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+import { Injectable } from "@angular/core";
+import { GuiConfigService } from "../../../common/service/gui-config.service";
+import { AuthService } from "../../../common/service/user/auth.service";
+import { createOpenAI } from "@ai-sdk/openai";
+import { generateText, type ModelMessage } from "ai";
+import { AppSettings } from "../../../common/app-setting";
+import { v4 as uuidv4 } from "uuid";
+import { WorkflowUtilService } from 
"../workflow-graph/util/workflow-util.service";
+import { OperatorPredicate } from "../../types/workflow-common.interface";
+import { WorkflowSettings } from "../../../common/type/workflow";
+import {
+  TEXERA_OVERVIEW,
+  TUPLE_DOCUMENTATION,
+  TABLE_DOCUMENTATION,
+  OPERATOR_DOCUMENTATION,
+  UDF_INPUT_PORT_DOCUMENTATION,
+  EXAMPLE_OF_GOOD_CONVERSION,
+  VISUALIZER_DOCUMENTATION,
+  EXAMPLE_OF_MULTIPLE_UDF_CONVERSION,
+  WORKFLOW_PROMPT,
+  MAPPING_PROMPT,
+} from "./migration-prompts";
+
+interface Cell {
+  cell_type: string;
+  metadata: { [key: string]: any };
+  // nbformat stores source as either a single string or an array of line 
strings.
+  source: string | string[];
+}
+
+export interface Notebook {
+  cells: Cell[];
+}
+
+interface WorkflowJSON {
+  operators: OperatorPredicate[];
+  operatorPositions: Record<string, { x: number; y: number }>;
+  links: any[];
+  commentBoxes: any[];
+  settings: WorkflowSettings;
+}
+
+interface CombinedMapping {
+  operator_to_cell: Record<string, string[]>;
+  cell_to_operator: Record<string, string[]>;
+}
+
+/**
+ * Wraps a single LLM chat session that converts a Jupyter notebook into a 
Texera
+ * workflow plus a cell<->operator mapping.
+ *
+ * Lifecycle: `initialize()` -> `verifyConnection()` (optional) ->
+ * `convertNotebookToWorkflow()` -> `close()`. The session keeps a running 
`messages`
+ * history shared by the prompts within one conversion. 
`convertNotebookToWorkflow()`
+ * resets that history to the documentation prelude at its start, so the same 
instance
+ * can convert multiple notebooks without leaking one conversion's context 
into the next.
+ *
+ * Output column types: intermediate UDFs declare their output columns as 
`binary` so rich
+ * Python objects (DataFrames, arrays, models) round-trip between operators 
via pickle.
+ * Terminal UDFs (no outgoing edge) declare their outputs as `string` so the 
result panel
+ * renders viewable values rather than opaque binary blobs.
+ */
+@Injectable()
+export class NotebookMigrationLLM {
+  private model: any;
+  private messages: ModelMessage[] = [];
+  private initialized = false;
+
+  private static readonly DOCUMENTATION: string[] = [
+    TEXERA_OVERVIEW,
+    TUPLE_DOCUMENTATION,
+    TABLE_DOCUMENTATION,
+    OPERATOR_DOCUMENTATION,
+    EXAMPLE_OF_GOOD_CONVERSION,
+    VISUALIZER_DOCUMENTATION,
+    UDF_INPUT_PORT_DOCUMENTATION,
+    EXAMPLE_OF_MULTIPLE_UDF_CONVERSION,
+  ];
+
+  constructor(
+    private config: GuiConfigService,
+    private workflowUtilService: WorkflowUtilService
+  ) {}
+
+  private get enabled(): boolean {
+    return this.config.env.pythonNotebookMigrationEnabled;
+  }
+
+  private assertEnabled(): void {
+    if (!this.enabled) {
+      throw new Error("Notebook migration feature is disabled");
+    }
+  }
+
+  /**
+   * Seed the conversation with the Texera documentation prelude, discarding 
any
+   * prior conversation. Used by initialize() and at the start of each 
conversion.
+   */
+  private seedDocumentation(): void {
+    this.messages = NotebookMigrationLLM.DOCUMENTATION.map(
+      (doc): ModelMessage => ({
+        role: "system",
+        content: doc,
+      })
+    );
+  }
+
+  private parseJsonResponse(raw: string, context: string): any {
+    let text = raw.trim();
+
+    // Prefer the contents of a fenced code block if present (```json ... ``` 
or ``` ... ```),
+    // even when wrapped in prose. Otherwise fall back to the outermost {...} 
object.
+    const fenced = text.match(/```(?:[a-zA-Z]+)?\s*([\s\S]*?)```/);
+    if (fenced) {
+      text = fenced[1].trim();
+    } else {
+      const firstBrace = text.indexOf("{");
+      const lastBrace = text.lastIndexOf("}");
+      if (firstBrace !== -1 && lastBrace > firstBrace) {
+        text = text.slice(firstBrace, lastBrace + 1);
+      }
+    }
+
+    try {
+      return JSON.parse(text);
+    } catch (err) {
+      throw new Error(`Failed to parse LLM ${context} response as JSON: ${(err 
as Error).message}`);
+    }
+  }
+
+  /**
+   * Initialize a new LLM session with Texera documentation
+   */
+  public initialize(modelType: string = "gpt-5-mini", accessToken: string = 
AuthService.getAccessToken() ?? ""): void {
+    this.assertEnabled();
+    this.model = createOpenAI({
+      baseURL: new URL(`${AppSettings.getApiEndpoint()}`, 
document.baseURI).toString(),
+      // The /api/chat/* LiteLLM proxy authenticates the caller with the 
Texera JWT. The AI SDK
+      // sends this value verbatim as `Authorization: Bearer <token>`, so we 
pass the user's
+      // access token; the backend validates it, then substitutes the LiteLLM 
master key upstream.
+      apiKey: accessToken,
+    }).chat(modelType);
+
+    this.seedDocumentation();
+
+    this.initialized = true;
+  }
+
+  /**
+   * Verify the connection to the LLM using the current access token
+   */
+  public async verifyConnection(): Promise<boolean> {
+    if (!this.enabled) return false;
+    if (!this.initialized) {
+      throw new Error("LLM session not initialized");
+    }
+
+    try {
+      await generateText({
+        model: this.model,
+        messages: [
+          {
+            role: "user",
+            content: "ping",
+          },
+        ],
+        maxOutputTokens: 10,
+      });
+
+      return true;
+    } catch (err) {
+      console.error("API key verification failed:", err);
+      return false;
+    }
+  }
+
+  /**
+   * Send a prompt and receive a response.
+   * All prior documentation and conversation is preserved.
+   */
+  private async sendPrompt(prompt: string): Promise<string> {
+    if (!this.initialized) {
+      throw new Error("LLM session not initialized");
+    }
+
+    this.messages.push({
+      role: "user",
+      content: prompt,
+    });
+
+    const result = await generateText({
+      model: this.model,
+      messages: this.messages,
+    });
+
+    this.messages.push({
+      role: "assistant",
+      content: result.text,
+    });
+
+    return result.text;
+  }
+
+  /**
+   * Send a Jupyter Notebook to be converted into a workflow and mapping.
+   */
+  public async convertNotebookToWorkflow(notebook: Notebook): Promise<string> {
+    this.assertEnabled();
+    if (!this.initialized) {
+      throw new Error("LLM session not initialized");
+    }
+
+    // Reset to the documentation prelude so a prior conversion's 
prompts/responses
+    // don't leak into this one. The two sendPrompt calls below still share 
history.
+    this.seedDocumentation();
+
+    const codeCells = notebook.cells.filter(cell => cell.cell_type === "code");
+
+    // Every code cell must carry a unique metadata.uuid; it is the join key 
for the
+    // cell<->operator mapping. Without it, untagged cells collide on the 
"undefined" marker.
+    const untagged = codeCells.find(cell => cell.metadata?.uuid == null || 
String(cell.metadata.uuid).trim() === "");
+    if (untagged) {
+      throw new Error("Notebook code cells must each have a metadata.uuid 
before conversion");
+    }
+
+    const notebookString = codeCells
+      .map(cell => {
+        const uuid = String(cell.metadata.uuid);
+        // nbformat line arrays already include trailing newlines, so join 
with "".
+        const source = Array.isArray(cell.source) ? cell.source.join("") : 
cell.source;
+        return `# START ${uuid}\n${source}\n# END ${uuid}`;
+      })
+      .join("\n\n");
+
+    const workflow = await 
this.sendPrompt(`${WORKFLOW_PROMPT}\n${notebookString}`);
+    const mapping = await this.sendPrompt(MAPPING_PROMPT);
+
+    // Remove ```json blocks and parse
+    const udfLLMResponse = this.parseJsonResponse(workflow, "workflow");
+
+    const workflowJSON: WorkflowJSON = {
+      operators: [],
+      operatorPositions: {},
+      links: [],
+      commentBoxes: [],
+      settings: {
+        dataTransferBatchSize: this.config.env.defaultDataTransferBatchSize,
+        executionMode: this.config.env.defaultExecutionMode,
+      },
+    };
+
+    const udfMappingToUUID: Record<string, string> = {};
+
+    // UDFs that are never the source of an edge are terminal (result-facing). 
Their outputs
+    // default to "string" so the result panel renders typed values; 
intermediate UDFs keep
+    // "binary" so rich objects (DataFrames, arrays, models) round-trip 
between operators via pickle.
+    const edgeSources = new Set<string>((udfLLMResponse.edges || 
[]).map(([source]: [string, string]) => source));
+
+    Object.entries(udfLLMResponse.code).forEach(([udfId, udfCode], i) => {
+      let udfOutputColumns: { attributeName: string; attributeType: string }[] 
= [];
+      if (udfLLMResponse.outputs && udfLLMResponse.outputs[udfId]) {
+        const attributeType = edgeSources.has(udfId) ? "binary" : "string";
+        udfOutputColumns = udfLLMResponse.outputs[udfId].map((attr: string) => 
({
+          attributeName: attr,
+          attributeType,
+        }));
+      }
+
+      // Build the operator from the live PythonUDFV2 schema so the 
operatorVersion, ports, and
+      // property defaults track the backend definition, then overlay the 
generated code/outputs.
+      const base = 
this.workflowUtilService.getNewOperatorPredicate("PythonUDFV2", udfId);
+      const operator: OperatorPredicate = {
+        ...base,
+        operatorProperties: {
+          ...base.operatorProperties,
+          code: udfCode,
+          retainInputColumns: false,
+          outputColumns: udfOutputColumns,
+        },
+      };
+
+      udfMappingToUUID[udfId] = operator.operatorID;
+      workflowJSON.operators.push(operator);
+      workflowJSON.operatorPositions[operator.operatorID] = { x: 140 * (i + 
1), y: 0 };
+    });
+
+    const knownUdfIds = new Set(Object.keys(udfMappingToUUID));
+
+    // Add links/edges. Skip (with a warning) any edge that references a UDF 
id the LLM
+    // never defined in `code`, rather than emitting a link with an undefined 
endpoint.
+    (udfLLMResponse.edges || []).forEach(([source, target]: [string, string]) 
=> {
+      if (!knownUdfIds.has(source) || !knownUdfIds.has(target)) {
+        console.warn(`Skipping edge with unknown UDF id: ${source} -> 
${target}`);
+        return;
+      }
+      workflowJSON.links.push({
+        linkID: `link-${uuidv4()}`,
+        source: {
+          operatorID: udfMappingToUUID[source],
+          portID: "output-0",
+        },
+        target: {
+          operatorID: udfMappingToUUID[target],
+          portID: "input-0",
+        },
+      });
+    });
+
+    // Parse mapping
+    const parsedMapping: Record<string, string[]> = 
this.parseJsonResponse(mapping, "mapping");
+
+    const udfToCell: Record<string, string[]> = {};
+    const cellToUdf: Record<string, string[]> = {};
+
+    Object.entries(parsedMapping).forEach(([udf, cells]) => {
+      if (!knownUdfIds.has(udf)) {
+        console.warn(`Skipping mapping entry with unknown UDF id: ${udf}`);
+        return;
+      }
+      const udfUUID = udfMappingToUUID[udf];
+      udfToCell[udfUUID] = cells;
+      cells.forEach(cell => {
+        if (!cellToUdf[cell]) {
+          cellToUdf[cell] = [udfUUID];
+        } else {
+          cellToUdf[cell].push(udfUUID);
+        }
+      });
+    });
+
+    const workflowNotebookMapping: CombinedMapping = {
+      operator_to_cell: udfToCell,
+      cell_to_operator: cellToUdf,
+    };
+
+    return JSON.stringify({ workflowJSON, workflowNotebookMapping });
+  }
+
+  /**
+   * Closes the session.
+   * Clears all context and releases references.
+   */
+  public close(): void {
+    this.messages = [];
+    this.model = null;
+    this.initialized = false;
+  }
+}
diff --git 
a/frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts 
b/frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts
new file mode 100644
index 0000000000..2594d2d58e
--- /dev/null
+++ b/frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts
@@ -0,0 +1,414 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+// TEXERA DOCUMENTATION
+
+// https://github.com/apache/texera/wiki/Guide-to-Use-a-Python-UDF
+export const TEXERA_OVERVIEW = `
+You are a robust compiler that takes python code and translates it to our 
personal workflow environment Texera that uses python.
+
+  Texera is a data analytics tool that uses workflows to do machine learning 
and data analytics computation. Users are able to drag and drop operators and 
connect their inputs and outputs in a workflow graphical user interface, which 
the code we are going to create.
+
+Texera is able to use Python user defined functions. Documentation of a Python 
UDF in Texera follows:
+  Process Data APIs
+
+There are three APIs to process the data in different units.
+
+  Tuple API.
+
+  class ProcessTupleOperator(UDFOperatorV2):
+
+def process_tuple(self, tuple_: Tuple, port: int) -> 
Iterator[Optional[TupleLike]]:
+yield tuple_
+
+Tuple API takes one input tuple from a port at a time. It returns an iterator 
of optional TupleLike instances. A TupleLike is any data structure that 
supports key-value pairs, such as pytexera.Tuple, dict, defaultdict, 
NamedTuple, etc.
+
+  Tuple API is useful for implementing functional operations which are applied 
to tuples one by one, such as map, reduce, and filter.
+
+  Table API.
+
+  class ProcessTableOperator(UDFTableOperator):
+
+def process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:
+yield table
+
+Table API consumes a Table at a time, which consists of all the tuples from a 
port. It returns an iterator of optional TableLike instances. A TableLike is a 
collection of TupleLike, and currently, we support pytexera.Table and 
pandas.DataFrame as a TableLike instance. More flexible types will be supported 
down the road.
+
+  Table API is useful for implementing blocking operations that will consume 
all the data from one port, such as join, sort, and machine learning training.
+
+  Batch API.
+
+  class ProcessBatchOperator(UDFBatchOperator):
+
+BATCH_SIZE = 10
+
+def process_batch(self, batch: Batch, port: int) -> 
Iterator[Optional[BatchLike]]:
+yield batch
+
+Batch API consumes a batch of tuples at a time. Similar to Table, a Batch is 
also a collection of Tuples; however, its size is defined by the BATCH_SIZE, 
and one port can have multiple batches. It returns an iterator of optional 
BatchLike instances. A BatchLike is a collection of TupleLike, and currently, 
we support pytexera.Batch and pandas.DataFrame as a BatchLike instance. More 
flexible types will be supported down the road.
+
+  The Batch API serves as a hybrid API combining the features of both the 
Tuple and Table APIs. It is particularly valuable for striking a balance 
between time and space considerations, offering a trade-off that optimizes 
efficiency.
+
+  All three APIs can return an empty iterator by yield None.
+
+  The template code for a Python UDF follows: MAKE SURE TO USE THE CLASS NAMES 
AND FUNCTIONS DEFINED, THIS IS A MUST FOR THE PROGRAM TO WORK. SELECT 1 OUT OF 
THE 3 PROCESSING OPERATOR FUNCTIONS TO BUILD DEPENDING ON THE CONTEXT OF CODE 
TRANSLATION.
+# Choose from the following templates:
+  #
+# from pytexera import *
+#
+# class ProcessTupleOperator(UDFOperatorV2):
+#
+#     @overrides
+#     def process_tuple(self, tuple_: Tuple, port: int) -> 
Iterator[Optional[TupleLike]]:
+#         yield tuple_
+#
+# class ProcessBatchOperator(UDFBatchOperator):
+#     BATCH_SIZE = 10 # must be a positive integer
+#
+#     @overrides
+#     def process_batch(self, batch: Batch, port: int) -> 
Iterator[Optional[BatchLike]]:
+#         yield batch
+#
+# class ProcessTableOperator(UDFTableOperator):
+#
+#     @overrides
+#     def process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:
+#         yield table
+`;
+
+// 
https://github.com/apache/texera/blob/main/amber/src/main/python/core/models/tuple.py
+export const TUPLE_DOCUMENTATION = `
+### **<code>Tuple</code> Class Overview**
+
+The \`Tuple\` class is a **lazy-evaluated** data structure designed for 
efficient field storage and access. It provides:
+
+  1. **Support for Multiple Data Sources**:
+* Can be initialized from a \`TupleLike\` object, such as a \`pandas.Series\`, 
\`OrderedDict\`, or another \`Tuple\` instance.
+* Works with \`ArrowTableTupleProvider\` to access \`pyarrow.Table\` data.
+2. **Lazy Field Evaluation**:
+* Field values can be either **directly stored values** or **lazy accessors** 
(\`field_accessor\`).
+* If a field is accessed and is an accessor, it is evaluated and cached.
+3. **Schema (<code>Schema</code>) Enforcement**:
+  * A \`Tuple\` can be created without a schema but can be **finalized** with 
one using \`finalize(schema)\`, which:
+* **Casts field values** (e.g., \`NaN → None\`, \`Object → Bytes\`).
+* **Validates field completeness**, ensuring all fields match the \`Schema\`.
+4. **Pythonic Access Patterns**:
+* **Index-based access**: \`tuple["field_name"]\` or \`tuple[index]\` 
retrieves field values.
+* **Dictionary-like operations**: \`tuple.as_dict()\` returns an 
\`OrderedDict\`, and \`tuple.as_series()\` converts to a \`pandas.Series\`.
+* **Iterable support**: \`for field in tuple\` iterates over field values.
+5. **Hashing and Comparisons**:
+* Implements \`__hash__\` using a Java-like hashing algorithm, allowing usage 
as dictionary keys.
+* Implements \`__eq__\`, supporting equality checks based on field contents.
+6. **Partial Data Extraction**:
+* \`tuple.get_partial_tuple(attribute_names)\` returns a new \`Tuple\` 
instance containing only the specified fields.
+`;
+
+// 
https://github.com/apache/texera/blob/main/amber/src/main/python/core/models/table.py
+export const TABLE_DOCUMENTATION = `### **<code>Table</code> Class Overview**
+
+The \`Table\` class extends \`pandas.DataFrame\`, providing **structured 
Tuple-based data management**. It is designed to integrate seamlessly with 
\`Tuple\` objects.
+
+#### **Key Features:**
+
+1. **Flexible Construction:**
+* Can be initialized from various sources:
+* Another \`Table\` (\`from_table(table)\`)
+* A \`pandas.DataFrame\` (\`from_data_frame(df)\`)
+* A list/iterator of \`TupleLike\` objects (\`from_tuple_likes(tuple_likes)\`)
+* Ensures all \`Tuple\` objects in a \`Table\` have **consistent field names**.
+2. **Tuple Conversion:**
+* \`as_tuples()\`: Converts the table rows into an **iterator of 
<code>Tuple</code> instances**, preserving the row order.
+3. **Equality Comparison (<code>__eq__</code>):**
+* Supports **row-wise equality checks** by comparing the underlying \`Tuple\` 
objects.
+4. **Universal Tuple Output (<code>all_output_to_tuple</code>):**
+* A helper function to convert **various data types** into \`Tuple\` 
iterators, supporting:
+* \`None\` → \`[None]\`
+* \`Table\` → \`as_tuples()\`
+* \`pandas.DataFrame\` → Converted into a \`Table\`, then to Tuples
+* \`List[TupleLike]\` → Converted to \`Tuple\` instances
+* A single \`TupleLike\` or \`Tuple\` → Wrapped in an iterator
+
+#### **Relation to <code>Tuple</code>:**
+
+* \`Table\` **stores multiple <code>Tuple</code> objects** and ensures schema 
consistency across rows.
+* Provides an **efficient bridge** between \`Tuple\`-based data and 
\`pandas.DataFrame\`, enabling compatibility with Python's data analysis tools.
+`;
+
+// 
https://github.com/apache/texera/blob/main/amber/src/main/python/core/models/operator.py
+export const OPERATOR_DOCUMENTATION = `### **Operator Class Overview**
+
+The \`Operator\` class is an **abstract base class (ABC)** for all operators, 
defining the fundamental structure for processing \`Tuple\`, \`Batch\`, and 
\`Table\` data in a workflow.
+
+#### **Key Features & Hierarchy**
+
+1. **Base <code>Operator</code> Class**:
+* Defines lifecycle methods: \`open()\` and \`close()\`.
+* Supports a **source flag (<code>is_source</code>)** to distinguish source 
operators from others.
+2. **Tuple-Based Processing (<code>TupleOperatorV2</code>)**:
+* Processes individual \`Tuple\` objects through \`process_tuple(tuple_, 
port)\`.
+* Calls \`on_finish(port)\` when an input port is exhausted.
+3. **Types of Operators**:
+* **SourceOperator**:
+* Produces data via \`produce()\`, yielding \`TupleLike\` or \`TableLike\` 
objects.
+* Overrides \`on_finish(port)\` to output produced data.
+* **BatchOperator**:
+* Collects tuples into batches (\`BATCH_SIZE\`) before processing via 
\`process_batch(batch, port)\`.
+* Converts processed batches (typically \`pandas.DataFrame\`) into \`Tuple\` 
output.
+* **TableOperator**:
+* Collects tuples into a \`Table\` before processing via 
\`process_table(table, port)\`.
+* Converts processed \`Table\` output back into tuples.
+4. **Data Flow & Processing**:
+* Operators receive data **tuple-by-tuple**, **batch-by-batch**, or 
**table-by-table** depending on the type.
+* Results are **iterators** of transformed data (\`TupleLike\`, \`BatchLike\`, 
or \`TableLike\`).
+5. **Deprecated <code>TupleOperator</code>**:
+* The older version of \`TupleOperator\` is deprecated in favor of 
\`TupleOperatorV2\`.
+
+#### Relation to <code>Tuple</code> and <code>Table</code>
+
+* Operators **consume and transform** \`Tuple\` and \`Table\` data within a 
workflow.
+* **Tuple-based operators** process row-wise, while **Table operators** handle 
structured table transformations.
+* **Source operators** initiate the data flow by generating tuples or tables.`;
+
+export const UDF_INPUT_PORT_DOCUMENTATION = `
+Python UDF operators support multiple input and output ports, allowing a 
single operator to receive different types of data from various upstream 
operators. In the process_tuple(self, tuple_: Tuple, port: int) function in 
ProcessTupleOperator and the process_table(self, table: Table, port: int) 
function in ProcessTableOperator, the port parameter indicates the input port. 
The port numbers are assigned in order, starting from 0 to N, from top to 
bottom. When input data have different sche [...]
+
+Using this knowledge, for situations where multiple upstream UDFs act as input 
to a single UDF, we can introduce an intermediary UDF that collects all of the 
input data and reformats it into a single table, which is then passed as input 
to the original next downstream UDF. When it is necessary for this to occur in 
your translation from notebook to UDFs, include the intermediary UDF and make 
sure that it and the next operator that uses its output is formatted correctly 
and handles the dat [...]
+`;
+
+export const EXAMPLE_OF_GOOD_CONVERSION = `
+Here is an example of python code translated into a compatible Texera UDF that 
gives output that abides the output schema compatible with the Texera workflow 
operators for tuples. Other operators do not always follow this strict format, 
but the yielding output structure is important.
+
+Python Code (high level idea): We have a python code that given some data, we 
limit the number of data.
+
+Texera Operator code:
+from pytexera import *
+
+class ProcessTupleOperator(UDFOperatorV2):
+def __init__(self):
+self.limit = 10
+self.count = 0
+@overrides
+def process_tuple(self, tuple_: Tuple, port: int) -> 
Iterator[Optional[TupleLike]]:
+if(self.count < self.limit):
+self.count += 1
+yield tuple_
+
+`;
+
+export const VISUALIZER_DOCUMENTATION = `
+Texera requires a unique way of generating visualizations from ML libraries:
+1. Ensures one yield per operator (per Texera’s UDF constraints).
+2. Uses Plotly for visualization and outputs results as embeddable HTML.
+3. Error handling is built-in to notify users when data is missing.
+`;
+
+export const EXAMPLE_OF_MULTIPLE_UDF_CONVERSION = `
+Here is an example of breaking up python code into multiple Texera UDFs. 
Format your response structure exactly like the given example. The "code" key 
contains a dictionary of the UDF ID's with their respective code. The "edges" 
key contains a list of pairs that contains the connections between UDFs. The 
"outputs" key contains a dictionary of the UDF ID's with a list of the output 
column names of the DataFrame that the UDF yields. The UDFs can branch and 
merge, it does not have to be a l [...]
+
+Original Code:
+\`\`\`python
+# START CELL1
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.svm import SVC
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import accuracy_score
+from sklearn.preprocessing import StandardScaler
+import matplotlib.pyplot as plt
+# END CELL1
+
+# START CELL2
+# Load the dataset
+file_path = 'diabetes.csv'
+data = pd.read_csv(file_path)
+# END CELL2
+
+# START CELL3
+# Remove duplicate rows
+data = data.drop_duplicates()
+
+# Remove rows with null values
+data = data.dropna()
+# END CELL3
+
+# START CELL4
+# Print the minimum, maximum, and mean for all fields
+print("Minimum values:\n", data.min())
+print("\nMaximum values:\n", data.max())
+print("\nMean values:\n", data.mean())
+# END CELL4
+
+# START CELL5
+# Create a boxplot for the 'Pregnancies' field
+plt.figure(figsize=(8, 6))
+plt.boxplot(data['Pregnancies'], vert=False, patch_artist=True)
+plt.title('Boxplot of Pregnancies')
+plt.xlabel('Number of Pregnancies')
+plt.show()
+# END CELL5
+
+# START CELL6
+# Separate features and target variable
+X = data.drop('Outcome', axis=1)
+y = data['Outcome']
+# END CELL6
+
+# START CELL7
+# Split data into training and testing sets (80% train, 20% test)
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, 
random_state=42)
+
+scaler = StandardScaler()
+X_train = scaler.fit_transform(X_train)
+X_test = scaler.transform(X_test)
+# END CELL7
+
+# START CELL8
+# Train Random Forest model
+rf_model = RandomForestClassifier(random_state=42)
+rf_model.fit(X_train, y_train)
+rf_pred = rf_model.predict(X_test)
+rf_accuracy = accuracy_score(y_test, rf_pred)
+print(f"Random Forest Accuracy: {rf_accuracy:.2%}")
+# END CELL8
+
+# START CELL9
+# Train SVM model
+svm_model = SVC(random_state=42)
+svm_model.fit(X_train, y_train)
+svm_pred = svm_model.predict(X_test)
+svm_accuracy = accuracy_score(y_test, svm_pred)
+print(f"SVM Accuracy: {svm_accuracy:.2%}")
+# END CELL9
+\`\`\`
+
+Texera UDF conversion:
+\`\`\`json
+{
+    "code": {
+        "UDF1": "# UDF1\nfrom pytexera import *\nimport pandas as pd\nfrom 
typing import Iterator, Optional\n\nclass 
ProcessTableOperator(UDFTableOperator):\n\n    @overrides\n    def 
process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:\n        # Remove duplicate rows\n        data = 
table.drop_duplicates()\n\n        # Remove rows with null values\n        data 
= data.dropna()\n\n        # Calculate statistics\n        min_values = 
data.min()\n        max_valu [...]
+        "UDF2": "# UDF2\nfrom pytexera import *\nimport pandas as pd\nimport 
plotly.express as px\nimport plotly.io\nfrom typing import Iterator, 
Optional\n\nclass ProcessTableOperator(UDFTableOperator):\n    def 
render_error(self, error_msg):\n        return '''<h1>Boxplot is not 
available.</h1>\n                  <p>Reason is: {} </p>\n               
'''.format(error_msg)\n\n    @overrides\n    def process_table(self, table: 
Table, port: int) -> Iterator[Optional[TableLike]]:\n         [...]
+        "UDF3": "# UDF3\nfrom pytexera import *\nimport pandas as pd\nfrom 
sklearn.model_selection import train_test_split\nfrom sklearn.preprocessing 
import StandardScaler\nfrom typing import Iterator, Optional\n\nclass 
ProcessTableOperator(UDFTableOperator):\n\n    @overrides\n    def 
process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:\n        data = table['data'].iloc[0]\n\n        
# Separate features and target variable\n        X = data.drop('Outcome', ax 
[...]
+        "UDF4": "# UDF4\nfrom pytexera import *\nimport pandas as pd\nfrom 
sklearn.ensemble import RandomForestClassifier\nfrom sklearn.metrics import 
accuracy_score\nfrom typing import Iterator, Optional\n\nclass 
ProcessTableOperator(UDFTableOperator):\n\n    @overrides\n    def 
process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:\n        X_train = table['X_train'].iloc[0]\n    
    y_train = table['y_train'].iloc[0]\n        X_test = 
table['X_test'].iloc[0]\n  [...]
+        "UDF5": "# UDF5\nfrom pytexera import *\nimport pandas as pd\nfrom 
sklearn.svm import SVC\nfrom sklearn.metrics import accuracy_score\nfrom typing 
import Iterator, Optional\n\nclass ProcessTableOperator(UDFTableOperator):\n\n  
  @overrides\n    def process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:\n        X_train = table['X_train'].iloc[0]\n    
    y_train = table['y_train'].iloc[0]\n        X_test = 
table['X_test'].iloc[0]\n        y_test = table['y [...]
+    },
+    "edges": [
+        ["UDF1", "UDF2"],
+        ["UDF1", "UDF3"],
+        ["UDF3", "UDF4"],
+        ["UDF3", "UDF5"]
+    ],
+    "outputs": {
+        "UDF1": ["min_values", "max_values", "mean_values", "data"],
+        "UDF2": ["html-content"],
+        "UDF3": ["X_train", "X_test", "y_train", "y_test"],
+        "UDF4": ["rf_model", "rf_accuracy", "X_test", "y_test"],
+        "UDF5": ["svm_model", "svm_accuracy", "X_test", "y_test"]
+    }
+}
+\`\`\`
+`;
+
+export const WORKFLOW_PROMPT = `You are an expert in Python coding and 
workflow systems.
+Many users of Texera system are non-technical, but the notebooks they provide 
are written by technical people.
+They want to convert their notebooks to Texera workflows.
+Your goal is to help convert these notebooks into a Texera workflow that 
non-technical users can use directly.
+So do not remove or modify any classes or functions, preserve their names and 
structure as they are.
+Ensure that all essential logic remains intact.
+Create multiple Texera UDF codes using the provided Python code.
+Number each UDF, starting at 1 and incrementing, by starting with a comment 
that states that UDF number.
+
+Use the class and function names as shown in ProcessTupleOperator, 
ProcessTableOperator, and ProcessBatchOperator.
+Do not change the class names, function names, or input parameters.
+Use the ones that make sense and split the code meaningfully as instructed.
+
+Use the starter code provided for Python UDFs.
+
+Use the documentation of Table, Tuple, or Batch to work with parameters within 
Texera UDF.
+Do not import other libraries to define these types.
+
+There is no need for an __init__ function. Assume all inputs are valid pandas 
DataFrames,
+so do not use .to_pandas(), .to_dataframe(), etc. Do not load data from a file 
in the first UDF;
+the workflow's source operator supplies the initial data, so assume it is 
already given to you in the
+table parameter. Replacing file-loading code with this input is the one 
exception to preserving all
+original code (see below).
+Ensure proper data flow between functions. Separate operators as if they will 
run in different files.
+
+Current UDF operators can only have one output. Build a dataframe to yield all 
necessary variables
+and data. Ensure proper data flow for each UDF and all information is yielded 
(including training
+and testing data) if subsequent UDFs need them.
+
+Ensure all necessary imports are included in each UDF code block.
+
+Each UDF operator should be in its own Python code block. Do not combine them 
into a single block.
+Ensure import statements cover all used functions and separate them as 
necessary.
+
+It is VERY important that all of the original code in the Jupyter notebook is 
represented in the generated workflow.
+Make sure that nothing in the original is removed and that the semantic 
meaning of what the original code was doing is retained.
+The only exception is data-loading code (e.g. pd.read_csv); it is represented 
by the workflow's input/source operator rather than copied into a UDF.
+If there are user-defined Python classes, include the entire class definition 
in the appropriate UDF(s) that use that class.
+Always include the code that defines the class inside of every distinct UDF 
that uses that constructs an object of that class.
+Python classes are allowed in Texera UDFs and follow the same semantics as 
standard Python.
+They can be defined outside of ProcessTableOperator, ProcessTupleOperator, and 
ProcessBatchOperator.
+
+Return only the JSON formatted response, do not give any explanation.
+Do not wrap the JSON in markdown code fences. Output raw JSON only.
+Make sure the response is a valid JSON structure, including closing all braces 
and not including commas after the last element.
+Follow this JSON format (don't reuse the values, this is just the format). 
'code', 'edges', and 'outputs' are all their own key's, do not nest any of 
these in another one and make sure to close their braces:
+{
+"code": {
+"UDF1": "code for UDF1 goes here",
+"UDF2": "code for UDF2 goes here"
+},
+"edges": [
+["UDF1", "UDF2"]
+],
+"outputs": {
+"UDF1": ["min_values", "max_values", "mean_values", "data"],
+"UDF2": ["html-content"]
+}
+}
+Make sure only the keys in the code section appear in the edges and outputs 
sections. Do not include any extraneous fields.
+Do not include any extraneous UDF's in the code field that include empty 
strings.
+Give ALL of the code, do not omit anything or use placeholders for code. Make 
sure ALL code in the original is translated over.
+The value of each UDF must be a valid JSON string: escape newlines, quotes, 
and backslashes correctly so that the decoded string is runnable Python. Use 
whichever quotes the Python code requires.
+Convert following the instructions and examples given. Here is the code:
+`;
+
+export const MAPPING_PROMPT = `
+Here is an example of a mapping generated between the given example Python 
code and the Texera UDFs using their CELL and UDF IDs. Cell IDs are designated 
by the UUID following '# START'. The format should be kept the same.
+{
+"UDF1": [
+"CELL3",
+"CELL4"
+],
+"UDF2": [
+"CELL5"
+],
+"UDF3": [
+"CELL6",
+"CELL7"
+],
+"UDF4": [
+"CELL8"
+]
+}
+Now create a mapping for the UDFs and the original code. Link the code blocks 
marked by 'START <cell-uuid>' and 'END <cell-uuid>' with the UDF UUID's. The 
code between them should be equivalent. Multiple cells can be mapped to the 
same UDF when that UDF implements the logic of those cells. There could be any 
number of cells and UDFs, so only create the correct number in the mapping. 
Only give the mapping.
+`;
diff --git a/frontend/yarn.lock b/frontend/yarn.lock
index 694ac59382..36ee0fa3cb 100644
--- a/frontend/yarn.lock
+++ b/frontend/yarn.lock
@@ -30,6 +30,18 @@ __metadata:
   languageName: node
   linkType: hard
 
+"@ai-sdk/openai@npm:2.0.67":
+  version: 2.0.67
+  resolution: "@ai-sdk/openai@npm:2.0.67"
+  dependencies:
+    "@ai-sdk/provider": "npm:2.0.0"
+    "@ai-sdk/provider-utils": "npm:3.0.17"
+  peerDependencies:
+    zod: ^3.25.76 || ^4.1.8
+  checksum: 
10c0/7e5c407504d7902c17c816aaccd83f642a3b82012cd8467c8f58aef5f08a49b6c31fff775439d541d40b0c8b5b94cc384f18096d1968e23670e22a56fe82d8bd
+  languageName: node
+  linkType: hard
+
 "@ai-sdk/provider-utils@npm:3.0.17":
   version: 3.0.17
   resolution: "@ai-sdk/provider-utils@npm:3.0.17"
@@ -10617,6 +10629,7 @@ __metadata:
   resolution: "gui@workspace:."
   dependencies:
     "@abacritt/angularx-social-login": "npm:2.3.0"
+    "@ai-sdk/openai": "npm:2.0.67"
     "@ali-hm/angular-tree-component": "npm:12.0.5"
     "@angular-builders/custom-webpack": "npm:21.0.3"
     "@angular-devkit/build-angular": "npm:21.2.8"

(texera) branch main updated: feat(python-notebook-migration): add LLM client for notebook-to-workflow conversion (#5260)

Reply via email to