Kontinuation opened a new pull request, #2138: URL: https://github.com/apache/datafusion-comet/pull/2138
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> Closes #. ## Rationale for this change The S3 object store support for the native parquet reader incorrectly url-decode the path. The path should already been url-decoded so decoding it again will corrupt the original path. If the path does not contain escape sequences then it is fine. However, if the S3 path has escape sequences, it will corrupt the path and we'll end up getting an error, or silently reading the wrong data. I found S3 paths containing escape sequences when reading a partitioned table. The partition key contains a '#' character and the S3 paths for files in the partitioned table are something like this: ``` s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet ``` Note that `Brand%2321` is part of the original S3 path, not the url-encoded path. The partition key is `Brand#21`, the directory names of partitioned tables are url-encoded by design to support any character sequences. If we url-decode this path twice, the resulting path will be `s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet`, which is different from the original path. ## What changes are included in this PR? This PR fixes the repeated S3 path url-decoding. Now native parquet reader could correctly handle S3 paths containing escape sequences. ## How are these changes tested? Add a new Scala test which writes a partitioned table with partition key containing '#' character and read it back using Comet. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org