[PR] [Hackathon] feat: add LLM file source [texera]

via GitHub Sat, 16 May 2026 10:36:22 -0700


tanishqgandhi1908 opened a new pull request, #5105:
URL: https://github.com/apache/texera/pull/5105


   ### What changes were proposed in this PR?
   
   This PR adds an `LLM File Source` operator that turns irregular files into 
usable no-code workflow inputs.
   
   **Story / motivation**
   
   `Smart Source` works well when a file already has a known machine-readable 
structure. But many real datasets arrive as PDFs, vendor reports, and 
semi-structured exports where the user can see the tables, yet Texera has no 
ready-made parser to begin with.
   
   The goal here is to keep that user inside the workflow canvas instead of 
sending them out to write preprocessing code first.
   
   | User task | Before | After |
   | --- | --- | --- |
   | Start from an irregular PDF/report | Manually extract tables or hand-write 
a parser outside the workflow | Select the file/folder and let Texera generate 
a parser tailored to that input |
   | Reuse the extracted data | Build custom branching logic by hand | See 
detected logical tables and create `Filter + Projection` branches directly from 
the property panel |
   | Process repeated reports | Handle each report separately | Point the same 
source at a folder of similarly structured reports and parse them together |
   
   **Main changes**
   
   1. Add `LLM File Source`, a new source operator for irregular files such as 
PDFs, reports, and similarly structured folders.
   2. Add a generation flow that samples the input, asks the LLM for logical 
table definitions plus parser code, validates the generated code with syntax 
checks and a dry run, and retries repairs when needed.
   3. Represent multi-table outputs through a `__table__` discriminator and a 
dense union schema so downstream operators can branch safely by detected table.
   4. Add frontend support to show detected tables and create `Filter + 
Projection` branches from the operator property panel.
   5. Extend folder handling so folder-backed datasets are materialized locally 
for parser execution and can combine repeated reports into one workflow source.
   6. Capture Python worker stdout/stderr so generated parser failures are 
easier to diagnose from workflow execution.
   
   ### Any related issues, documentation, discussions?
   
   - Related to #5059.
   - This PR is stacked on #5094 because it builds on the Smart Source 
folder-input groundwork from that PR.
   
   ### How was this PR tested?
   
   ```bash
   JAVA_HOME=$(/usr/libexec/java_home -v 17) sbt "testOnly 
org.apache.texera.web.resource.LLMSourceResourceSpec 
org.apache.texera.amber.operator.source.scan.FolderInputResolverSpec 
org.apache.texera.amber.operator.source.llm.LLMFileSourceOpDescSpec"
   ```
   
   Manual verification:
   
   1. Ran a single PDF through `LLM File Source`; it produced 17 source rows 
and split them into 12 `revenue_by_region` rows and 5 `headcount_by_department` 
rows.
   2. Ran a folder with two similarly structured PDF reports; it produced 34 
source rows and split them into 24 `revenue_by_region` rows and 10 
`headcount_by_department` rows.
   3. Verified the property panel shows detected tables and that the `Filter + 
Projection` action creates the expected downstream branches.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] [Hackathon] feat: add LLM file source [texera]

Reply via email to