paul-rogers commented on a change in pull request #2270:
URL: https://github.com/apache/drill/pull/2270#discussion_r671921107
##########
File path:
contrib/storage-http/src/main/java/org/apache/drill/exec/store/http/util/SimpleHttp.java
##########
@@ -322,6 +336,99 @@ private void setupCache(Builder builder) {
return formBodyBuilder;
}
+ /**
+ * Returns the URL-decoded URL
+ *
+ * @return Returns the URL-decoded URL
+ */
+ public static String decodedURL(HttpUrl url) {
+ try {
+ return URLDecoder.decode(url.toString(), "UTF-8");
+ } catch (UnsupportedEncodingException e) {
+ return url.toString();
+ }
+ }
+
+ /**
+ * Returns true if the url has url parameters, as indicated by the presence
of
+ * {param} in a url.
+ *
+ * @return True if there are URL params, false if not
+ */
+ public static boolean hasURLParameters(HttpUrl url) {
+ String decodedUrl = SimpleHttp.decodedURL(url);
+ Matcher matcher = URL_PARAM_REGEX.matcher(decodedUrl);
+ return matcher.find();
+ }
+
+ /**
+ * APIs sometimes are structured with parameters in the URL itself. For
instance, to request a list of
+ * an organization's repositories in github, the URL is:
https://api.github.com/orgs/{org}/repos, where
+ * you can replace the org with the actual organization name.
+ *
+ * @return A list of URL parameters enclosed by curly braces.
+ */
+ public static List<String> getURLParameters(HttpUrl url) {
+ List<String> parameters = new ArrayList<>();
+ String decodedURL;
+ try {
+ decodedURL = URLDecoder.decode(url.toString(), "UTF-8");
+ } catch (UnsupportedEncodingException e) {
+ return null;
+ }
+ Matcher matcher = URL_PARAM_REGEX.matcher(decodedURL);
+ String param;
+ while (matcher.find()) {
+ param = matcher.group();
+ param = param.replace("{", "");
+ param = param.replace("}", "");
+ parameters.add(param);
+ }
+ return parameters;
+ }
+
+ /**
+ * Used for APIs which have parameters in the URL. This function maps the
filters pushed down
+ * from the query into the URL. For example the API:
github.com/orgs/{org}/repos requires a user to
+ * specify an organization and replace {org} with an actual organization.
The filter is passed down from
+ * the query.
+ *
+ * Note that if a URL contains URL parameters and one is not provided in the
filters, Drill will throw
+ * a UserException.
+ *
+ * @param url The HttpUrl containing URL Parameters
+ * @param filters: A HashMap of filters
+ * @return A string of the URL with the URL parameters replaced by filter
values
+ */
+ public static String mapURLParameters(HttpUrl url, Map<String, String>
filters) {
+ if (! hasURLParameters(url)) {
+ return url.toString();
+ }
+
+ List<String> params = SimpleHttp.getURLParameters(url);
+ String tempUrl = SimpleHttp.decodedURL(url);
+ for (String param : params) {
+ if (filters == null) {
+ throw UserException
+ .parseError()
+ .message("API Query with URL Parameters must be populated. Parameter
" + param + " must be included in WHERE clause.")
+ .build(logger);
+ }
+
+ String value = filters.get(param);
+ // If the param is not populated, throw an exception
+ if (Strings.isNullOrEmpty(value)) {
+ throw UserException
+ .parseError()
+ .message("API Query with URL Parameters must be populated. Parameter
" + param + " must be included in WHERE clause.")
+ .build(logger);
+ } else {
+ tempUrl = tempUrl.replace("/{" + param + "}", "/" + value);
Review comment:
(Edited to simplify the examples; the alert reader of the previous
version would have noted that the information was redundant.)
Three comments:
1. The above only replaces the first instance. For
`.../{ORG}/projects/{ORG}` there will be two entries, so this might be OK.
2. The `param` is the string from the URL, so cases match. But, if we mapped
names to lower case to match the case-insensitive column names, we have to
remember to use the original case here.
3. The substitution is unstable. If I have the pathological
`.../{foo}/{bar}` and in my query I have `foo='{bar}', then the result will be
wrong.
4. It turns out that the regex find method can tell us the offset of the
matched pattern. It would be better (and would solve the above issues) if we
converted the input URL into a list of pattern values in which the pattern has
the start offset, length, name and optional default. The code which does
substitutions can, from the end/start values, figure out the parts of the
original URL to keep. Examples, noting that we do nothing to police where the
parameters are placed in the URL:
Parameter is the entire URL:
```text
{url} -->
[{start=0, length=5, name="url"}]
```
Parameter is at the start of the URL:
```text
{protocol}://foo.com/latest -->
[start=0, length=8, name="protocol"}]
```
Parameter is at the end of the URL:
```text
https://foo.com/api/{op} -->
[{start=xx, length=xx, name="op"}]
```
Parameter is at the end of the URL, and is part of a query string:
```text
https://foo.com/api/?op={op} -->
[{start=xx, length=xx, name="op"}]
```
Parameter appears twice in the URL:
```text
https://foo.com/api/{org}/projects/{org}/details -->
[{start=xx, length=xx, name="org"},
{start=xx, length=xx, name="org"}]
```
In each of the above, it is easy to figure out the part of the original URL
to preserve: it is the part from 0 to the first `start`, the text between
successive `end`/`start` pairs, and from the last `end` to the end of the URL.
Would be great to have unit tests for each of these cases as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]