tustvold opened a new issue, #8009:
URL: https://github.com/apache/arrow-datafusion/issues/8009

   ### Describe the bug
   
   The behaviour of ListingTableUrl with respect to paths containing percent 
characters is rather confusing, and I suspect not entirely intentional.
   
   Consider a filesystem containing a file named `bar%2Ffoo`, there is actually 
no obvious way to address this file.
   
   ```
   let url = ListingTableUrl::parse("file:///foo/bar%2Ffoo").unwrap();
   assert_eq!(url.prefix.as_ref(), "foo/bar/foo");
   
   let url = ListingTableUrl::parse("file:///foo/a%252Fb.txt").unwrap();
   assert_eq!(url.prefix.as_ref(), "foo/a%252Fb.txt");
   
   let dir = tempdir().unwrap();
   let path = dir.path().join("bar%2Ffoo");
   std::fs::File::create(&path).unwrap();
   
   let url = ListingTableUrl::parse(path.to_str().unwrap()).unwrap();
   assert!(url.prefix.as_ref().ends_with("bar%252Ffoo"), "{}", url.prefix);
   ```
   
   
   
   ### To Reproduce
   
   _No response_
   
   ### Expected behavior
   
   The "correct" behaviour is that a file URL should be URL-encoded. That is 
according to the URL specification the correct way to reference this path would 
be `file:///foo/a%252Fb.txt`. Similarly the non-URL version should be 
`foo/a%2Fb.txt`.
   
   That being said various tools instead interpret the URL path verbatim:
   
   ```
   $ touch 'a%2Fb.txt'
   
   $ aws --endpoint-url=http://localhost:4566 s3 cp 'a%2Fb.txt' s3://tustvold/
   
   $ aws --endpoint-url=http://localhost:4566 s3 ls s3://tustvold/
   2023-10-31 15:40:13          0 a%2Fb.txt
   
   $ aws --endpoint-url=http://localhost:4566 s3 cp 's3://tustvold/a%2Fb.txt' 
foo.txt
   aws --endpoint-url=http://localhost:4566 s3 cp 's3://tustvold/a%2Fb.txt' 
foo.txt
   
   $ gsutil cp a\%2Fb.txt gs://tustvold
   
   $ gsutil cp gs://tustvold/a\%2Fb.txt test
   ```
   
   I'm not entirely sure how to classify DataFusion's current behaviour other 
than confusing. I think we should probably strive to replicate tools like the 
aws-cli and gsutil.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to