GitHub user kunwp1 added a comment to the discussion: Supporting Directory-Based Dataset Access in UDFs
Following our offline discussion, we've decided to integrate new operators being developed by @aglinxinyuan with the Python UDF. The proposed workflow is as follows: - Dataset Selector (Source): This operator accepts a dataset URI (e.g., /ownerEmail/datasetName/versionName) and flattens its structure into a table of file URIs. For instance, a nested structure like /folder1/file1 and /folder1/folder2/file2 will be returned as individual string rows. - Text to File Scan: It's a downstream operator that resolves the URIs into file contents. Users can toggle an option to include the original URI as an attribute, resulting in tuples of (file_uri, file_content). - Python UDF: This operator consumes these tuples, providing users with the raw paths and data. We initially considered using `io.BytesIO` to provide a folder-like interface within the UDF. However, it seems like `io.ByteIO` can only mimic a single file, not a folder-like file system. So, the responsibility for reconstructing or mimicking a file-tree structure (e.g., `tempfile`) will rest with the UDF logic itself. Please feel free to add your comments or suggestions on this approach. CC: @aglinxinyuan @chenlica @xuang7 GitHub link: https://github.com/apache/texera/discussions/4352#discussioncomment-16495961 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
