Kontinuation opened a new pull request, #2138:
URL: https://github.com/apache/datafusion-comet/pull/2138

   ## Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Closes #.
   
   ## Rationale for this change
   
   The S3 object store support for the native parquet reader incorrectly 
url-decode the path. The path should already been url-decoded so decoding it 
again will corrupt the original path. If the path does not contain escape 
sequences then it is fine. However, if the S3 path has escape sequences, it 
will corrupt the path and we'll end up getting an error, or silently reading 
the wrong data.
   
   I found S3 paths containing escape sequences when reading a partitioned 
table. The partition key contains a '#' character and the S3 paths for files in 
the partitioned table are something like this:
   
   ```
   s3://bucket_name/path/to/data/p_brand=Brand%2321/part-xxxx.parquet
   ```
   
   Note that `Brand%2321` is part of the original S3 path, not the url-encoded 
path. The partition key is `Brand#21`, the directory names of partitioned 
tables are url-encoded by design to support any character sequences.
   
   If we url-decode this path twice, the resulting path will be 
`s3://bucket_name/path/to/data/p_brand=Brand#21/part-xxxx.parquet`, which is 
different from the original path.
   
   ## What changes are included in this PR?
   
   This PR fixes the repeated S3 path url-decoding. Now native parquet reader 
could correctly handle S3 paths containing escape sequences.
   
   ## How are these changes tested?
   
   Add a new Scala test which writes a partitioned table with partition key 
containing '#' character and read it back using Comet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to