tanishqgandhi1908 opened a new pull request, #5105: URL: https://github.com/apache/texera/pull/5105
### What changes were proposed in this PR? This PR adds an `LLM File Source` operator that turns irregular files into usable no-code workflow inputs. **Story / motivation** `Smart Source` works well when a file already has a known machine-readable structure. But many real datasets arrive as PDFs, vendor reports, and semi-structured exports where the user can see the tables, yet Texera has no ready-made parser to begin with. The goal here is to keep that user inside the workflow canvas instead of sending them out to write preprocessing code first. | User task | Before | After | | --- | --- | --- | | Start from an irregular PDF/report | Manually extract tables or hand-write a parser outside the workflow | Select the file/folder and let Texera generate a parser tailored to that input | | Reuse the extracted data | Build custom branching logic by hand | See detected logical tables and create `Filter + Projection` branches directly from the property panel | | Process repeated reports | Handle each report separately | Point the same source at a folder of similarly structured reports and parse them together | **Main changes** 1. Add `LLM File Source`, a new source operator for irregular files such as PDFs, reports, and similarly structured folders. 2. Add a generation flow that samples the input, asks the LLM for logical table definitions plus parser code, validates the generated code with syntax checks and a dry run, and retries repairs when needed. 3. Represent multi-table outputs through a `__table__` discriminator and a dense union schema so downstream operators can branch safely by detected table. 4. Add frontend support to show detected tables and create `Filter + Projection` branches from the operator property panel. 5. Extend folder handling so folder-backed datasets are materialized locally for parser execution and can combine repeated reports into one workflow source. 6. Capture Python worker stdout/stderr so generated parser failures are easier to diagnose from workflow execution. ### Any related issues, documentation, discussions? - Related to #5059. - This PR is stacked on #5094 because it builds on the Smart Source folder-input groundwork from that PR. ### How was this PR tested? ```bash JAVA_HOME=$(/usr/libexec/java_home -v 17) sbt "testOnly org.apache.texera.web.resource.LLMSourceResourceSpec org.apache.texera.amber.operator.source.scan.FolderInputResolverSpec org.apache.texera.amber.operator.source.llm.LLMFileSourceOpDescSpec" ``` Manual verification: 1. Ran a single PDF through `LLM File Source`; it produced 17 source rows and split them into 12 `revenue_by_region` rows and 5 `headcount_by_department` rows. 2. Ran a folder with two similarly structured PDF reports; it produced 34 source rows and split them into 24 `revenue_by_region` rows and 10 `headcount_by_department` rows. 3. Verified the property panel shows detected tables and that the `Filter + Projection` action creates the expected downstream branches. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
