shyjsarah opened a new pull request, #7902:
URL: https://github.com/apache/paimon/pull/7902
### Purpose
`PaimonVirtualFileSystem` (PVFS) backed every OSS `open()` with `ossfs`,
whose fsspec buffered file flushes each block via OSS `AppendObject`. When a
result exceeds one block, a later `AppendObject` can fail with
`PositionNotEqualToLength` (409) on the OSS data-acceleration endpoint due to
file-length cache lag.
This PR makes the PVFS OSS backend selectable via `fs.oss.impl` — the same
option `PyArrowFileIO` already honors:
- `jindo` (default): native JindoSDK, writes via `PutObject` / multipart
upload, so `AppendObject` is never used.
- `legacy`: `ossfs` (the previous behavior).
When `fs.oss.impl=jindo` but `pyjindosdk` is not installed, PVFS falls
back to `ossfs`, consistent with `PyArrowFileIO`.
Notes:
- `build_jindo_config` is now shared by `JindoFileSystemHandler` (the
PyArrow FileIO path) and the new `create_jindo_oss_filesystem` (the PVFS path),
so both jindo entry points
consume identical OSS credential options.
- `pyjindo`'s fsspec filesystem needs the `oss://` scheme on paths while
`ossfs` needs it stripped, so `_strip_storage_protocol` keeps or strips the
scheme according to the active backend.
- No storage format change; no user-facing API change — `fs.oss.impl`
already exists, this PR only makes PVFS honor it.
### Tests
- New `pvfs_oss_filesystem_test.py` — CI unit tests, no DLF / OSS /
pyjindosdk required (all stubbed):
- `fs.oss.impl` dispatch: `jindo` / `legacy` / default / fallback when
pyjindosdk is absent / invalid value
- `_get_filesystem` forwards the table's OSS storage path to the backend
- `_strip_storage_protocol` keeps `oss://` for jindo, strips it for ossfs
- `_extract_oss_bucket` parsing and error cases
- Verified end-to-end against a real DLF REST catalog with both backends
(`legacy` and `jindo`) on pyjindo 6.10.2: open/read/write (including an 8 MB
multipart write), `info`, `ls`, `cat_file`, `cp_file`, `mv`, `rm`, `get_file`
all succeed on the jindo backend.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]