eugenegujing opened a new issue, #5634:
URL: https://github.com/apache/texera/issues/5634
### Task Summary
## Background
Sub-task of #5242 (external data source import and export). Per the
discussion in #4240, the import direction is handled as a separate effort from
the Google Drive export work (#5250 / #5251 / #5252 by @Sentiaus).
This task adds the **import** direction for the first provider, Google
Drive, along with a minimal provider abstraction so additional providers
(Dropbox — see sibling sub-issue; Box later) can plug in, as suggested by
@xuang7.
## Design principles (following @aicam's review decisions on the export flow
in #4240)
1. **No token persistence** — the user authorizes each import; the frontend
obtains a one-time OAuth access token, passes it to the backend with the import
request, and the backend discards it after the transfer. Nothing is ever stored
in the DB.
2. **Backend streaming** — the backend streams the file directly from the
Google Drive API into dataset storage (LakeFS/S3), reusing the existing
multipart upload pipeline in `DatasetResource`. The file never round-trips
through the browser.
## Scope choice: Google Picker + `drive.file` only
The frontend uses the official **Google Picker** for file selection,
requesting only the **`drive.file`** scope:
- `drive.file` is a **non-sensitive** scope: deployments need **no Google
restricted-scope security verification**, in any status (Testing or Production).
- Google enforces at the permission layer that the app can only access files
the user explicitly picked in the Picker — the app never lists or sees the rest
of the user's Drive.
- The consent screen reduces to a single grant ("access only the specific
files you use with this app"), which also avoids partial-consent failure modes
(e.g., an in-app file browser turning up empty when a user denies the broad
scope).
The alternative — rendering a Drive file tree inside Texera — would require
the restricted `drive.readonly` scope ("see and download all your Google Drive
files") and a weeks-long Google security review per public deployment, for a UX
that is effectively identical in the single-file import case. The provider
interface still exposes a `listFiles` capability so in-app browsing can be
added later if the community accepts that cost (the Dropbox provider will use
it, since Dropbox has no comparable review process).
## Proposed changes
- **Backend (file-service):**
- A small provider interface (e.g. `CloudStorageImportProvider`: list
files / open download stream) with a Google Drive implementation (`GET
/drive/v3/files/{fileId}?alt=media` streaming download using the one-time
bearer token).
- A new endpoint on `DatasetResource`, e.g. `POST
/dataset/{did}/import-from-cloud`, taking `{provider, accessToken, fileId,
fileName}`; streams the file into the dataset's LakeFS repo and stages/commits
it like a normal upload, with the same dataset write-permission checks as the
existing upload endpoint.
- Provider configuration (OAuth client ID, Picker API key) via env vars,
mirroring how Google login is configured today
(`UserSystemConfig.googleClientId`).
- **Frontend:**
- An "Import from cloud" entry next to the existing file uploader on the
dataset page.
- One-time authorization via the Google Identity Services token client
(`drive.file` scope), then the Google Picker for selection, restricted with
`setSelectableMimeTypes` to formats Texera datasets accept (csv, json, parquet,
text, etc.), so unsupported types (videos, native Google Docs) are filtered out
at selection time instead of failing after import.
- A simple in-progress indicator while the backend streams the file into
storage.
## Out of scope (future iterations)
- **Folder / bulk import** — folder-level access would require the
restricted `drive.readonly` scope and Google's security verification
(weeks-to-months review); deferred until the need is validated.
- **In-app Drive file browsing** — same `drive.readonly` requirement; see
scope-choice section above.
- **Background/async job handling for very large files** (the 100 GB–1 TB
scale discussed in #4240); the MVP imports synchronously within the request.
- **Native Google Docs/Sheets/Slides export conversion** — only
binary/regular files are imported in the MVP.
### Task Type
- [ ] Refactor / Cleanup
- [ ] DevOps / Deployment / CI
- [ ] Testing / QA
- [ ] Documentation
- [ ] Performance
- [x] Other
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]