kaka11chen opened a new pull request, #33858:
URL: https://github.com/apache/doris/pull/33858

   ## Proposed changes
   
   ### Issue:
   Many domestic cloud vendors are compatible with the s3 protocol. However, 
early versions of s3 client will only generate path style http requests 
(https://github.com/aws/aws-sdk-java-v2/pull/763) when encountering endpoints 
that do not start with s3, while some cloud vendors only support virtual host 
style http request.
   
   Therefore, Doris used `forceVirtualHosted` in `S3URI` to convert it into a 
virtual hosted path and implemented it through path style.
   For example:
   For s3 uri `s3://my-bucket/data/file.txt`, It will eventually be parsed into:
   - virtualBucket: my-bucket
   - Bucket: data (bucket must be set, otherwise the s3 client will report an 
error) Especially this step is particularly tricky because of the limitations 
of the s3 client.
   - Key: file.txt
   
    The path style mode is used to generate an http request similar to the 
virtual host by setting the endpoint to virtualBucket + original endpoint, 
setting the bucket and key.
   **However, the bucket and key here are inconsistent with the original 
concepts of s3, but the aws client happens to be able to generate an http 
request similar to the virtual host through the path style mode.**
   
   However, after #30799 we have upgrade the aws sdk version from 2.17.257 to 
2.20.131. The current aws s3 client can already generate a virtual host by 
third party by default style of http request. So in #31111 need to set the path 
style option, let the s3 client use doris' virtual bucket mechanism to continue 
working.
   
   **Finally, the virtual bucket mechanism is too confusing and tricky, and we 
no longer need it with the new version of s3 client.**
   
   ### Resolution
   
   Rewrite `S3URI` to remove tricky virtual bucket mechanism and support 
different uri styles by flags.
   
   This class represents a fully qualified location in S3 for input/output 
operations expressed as as URI.
    #### For AWS S3, URI common styles:
     - AWS Client Style(Hadoop S3 Style): 
`s3://my-bucket/path/to/file?versionId=abc123&partNumber=77&partNumber=88`
     - Virtual Host Style: 
`https://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
     - Path Style: 
`https://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88`
    
     Regarding the above-mentioned common styles, we can use 
<code>isPathStyle</code> to control whether to use path style
     or virtual host style.
     "Virtual host style" is the currently mainstream and recommended approach 
to use, so the default value of
     <code>isPathStyle</code> is false.
    
     #### Other Styles:
     - Virtual Host AWS Client (Hadoop S3) Mixed Style:
     
s3://my-bucket.s3.us-west-1.amazonaws.com/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
     or
     - Path AWS Client (Hadoop S3) Mixed Style:
     
s3://s3.us-west-1.amazonaws.com/my-bucket/resources/doc.txt?versionId=abc123&partNumber=77&partNumber=88
    
     For these two styles, we can use <code>isPathStyle</code> and 
<code>forceParsingByStandardUri</code>
     to control whether to use.
     Virtual Host AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = false 
&& forceParsingByStandardUri = true</code>
     Path AWS Client (Hadoop S3) Mixed Style: <code>isPathStyle = true && 
forceParsingByStandardUri = true</code>
    
     When the incoming location is url encoded, the encoded string will be 
returned.
     For <code>getKey()</code>, <code>getQueryParams()</code> will return the 
encoding string
   
   
   ## Further comments
   
   If this is a relatively large or complex change, kick off the discussion at 
[[email protected]](mailto:[email protected]) by explaining why you 
chose the solution you did and what alternatives you considered, etc...
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to