Re: [PR] Azure multi read options (druid)

via GitHub Tue, 23 Jan 2024 17:07:12 -0800


317brian commented on code in PR #15630:
URL: https://github.com/apache/druid/pull/15630#discussion_r1464158396



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|

Review Comment:
   ```suggestion
   |uris|JSON array of URIs where the Azure objects to be ingested are located. 
Use this format: 
`azureStorage://STORAGE_ACCOUNT/CONTAINER/PATH_TO_FILE`|None|One of the 
following must be set:`uris`, `prefixes`, or `objects`.|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`. 
Empty objects starting with one of the given prefixes are skipped.|None|`uris` 
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or 
`objects` must be set|

Review Comment:
   ```suggestion
   |objects|JSON array of Azure objects to ingest.|None|One of the following 
must be set:`uris`, `prefixes`, or `objects`.|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`. 
Empty objects starting with one of the given prefixes are skipped.|None|`uris` 
or `prefixes` or `objects` must be set|

Review Comment:
   ```suggestion
   |prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest. Use this format`azureStorage://STORAGE_ACCOUNT/CONTAINER/PREFIX`. Empty 
objects starting with any of the given prefixes are skipped.|None|One of the 
following must be set:`uris`, `prefixes`, or `objects`.|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`. 
Empty objects starting with one of the given prefixes are skipped.|None|`uris` 
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or 
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI 
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br 
/><br />The glob must match the entire object part, not just the filename. For 
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`, 
because the object part is `bar/file.json`, and the`*` does not match the 
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br 
/>For more information, refer to the documentation for 
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows. 
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`), 
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration. 
See below for more information.|None|No (defaults will be used if not given)
+
+Note that the Azure input source skips all empty objects only when `prefixes` 
is specified.
+
+The `objects` property is:
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|bucket|Name of the Azure Blob Storage or Azure Data Lake storage 
account|None|yes|
+|path|The container and path where data is located.|None|yes|
+
+
+The `properties` property is:
+Either set sharedAccessStorageToken OR key OR 
appRegistrationClientId/appRegistrationClientSecret/tenantId OR set nothing.
+
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|sharedAccessStorageToken|The plain text string of this Azure Blob Storage 
Shared Access Token|None|No|
+|key|The root key of Azure Blob Storage Account|None|no|
+|appRegistrationClientId|The client ID of the Azure App registration to 
authenticate as|None|No|
+|appRegistrationClientSecret|The client secret of the Azure App registration 
to authenticate as|None|Yes if `appRegistrationClientId` is provided|
+|tenantId|The tenant ID of the Azure App registration to authenticate 
as|None|Yes if `appRegistrationClientId` is provided|
+
+<details closed>
+  <summary>The v1 'azure' input source</summary>
+The old `azure` input source did not support specifying which storage account 
to ingest from so it has been deprecated.

Review Comment:
   ```suggestion
     <summary>Show the deprecated 'azure' input source</summary>
   
   Note that the deprecated `azure` input source doesn't support specifying 
which storage account to ingest from. We recommend using the `azureStorage` 
instead.
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`. 
Empty objects starting with one of the given prefixes are skipped.|None|`uris` 
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or 
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI 
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br 
/><br />The glob must match the entire object part, not just the filename. For 
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`, 
because the object part is `bar/file.json`, and the`*` does not match the 
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br 
/>For more information, refer to the documentation for 
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|

Review Comment:
   ```suggestion
   |objectGlob|A glob for the object part of the Azure URI. In the URI 
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br 
/><br />The glob must match the entire object part, not just the filename. For 
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json` 
because the object part is `bar/file.json`, and the`*` does not match the 
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br 
/>For more information, refer to the documentation for 
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.

Review Comment:
   ```suggestion
   :::info
   The  old `azure` schema is deprecated. Update your specs to use the 
`azureStorage` schema described below instead.
   :::
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`. 
Empty objects starting with one of the given prefixes are skipped.|None|`uris` 
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or 
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI 
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br 
/><br />The glob must match the entire object part, not just the filename. For 
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`, 
because the object part is `bar/file.json`, and the`*` does not match the 
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br 
/>For more information, refer to the documentation for 
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows. 
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`), 
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration. 
See below for more information.|None|No (defaults will be used if not given)

Review Comment:
   ```suggestion
   |properties|Properties object for overriding the default Azure 
configuration. See below for more information.|None|No (defaults will be used 
if not given)
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`. 
Empty objects starting with one of the given prefixes are skipped.|None|`uris` 
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or 
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI 
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br 
/><br />The glob must match the entire object part, not just the filename. For 
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`, 
because the object part is `bar/file.json`, and the`*` does not match the 
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br 
/>For more information, refer to the documentation for 
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows. 
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`), 
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration. 
See below for more information.|None|No (defaults will be used if not given)
+
+Note that the Azure input source skips all empty objects only when `prefixes` 
is specified.
+
+The `objects` property is:

Review Comment:
   ```suggestion
   The `objects` property can one of the following:
   ```



##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
 The Azure input source reads objects directly from Azure Blob store or Azure 
Data Lake sources. You can
 specify objects as a list of file URI strings or prefixes. You can split the 
Azure input source for use with [Parallel task](./native-batch.md) indexing and 
each worker task reads one chunk of the split data.
 
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage` 
schema instead.
+
+Sample specs:
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "uris": ["azureStorage://storageAccount/container/prefix1/file.json", 
"azureStorage://storageAccount/container/prefix2/file2.json"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.parquet",
+        "prefixes": ["azureStorage://storageAccount/container/prefix1/", 
"azureStorage://storageAccount/container/prefix2/"]
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+
+```json
+...
+    "ioConfig": {
+      "type": "index_parallel",
+      "inputSource": {
+        "type": "azureStorage",
+        "objectGlob": "**.json",
+        "objects": [
+          { "bucket": "storageAccount", "path": 
"container/prefix1/file1.json"},
+          { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+        ],
+        "properties": {
+          "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+        }
+      },
+      "inputFormat": {
+        "type": "json"
+      },
+      ...
+    },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located, 
in the form 
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or 
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to 
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`. 
Empty objects starting with one of the given prefixes are skipped.|None|`uris` 
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or 
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI 
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br 
/><br />The glob must match the entire object part, not just the filename. For 
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`, 
because the object part is `bar/file.json`, and the`*` does not match the 
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br 
/>For more information, refer to the documentation for 
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows. 
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`), 
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration. 
See below for more information.|None|No (defaults will be used if not given)
+
+Note that the Azure input source skips all empty objects only when `prefixes` 
is specified.
+
+The `objects` property is:
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|bucket|Name of the Azure Blob Storage or Azure Data Lake storage 
account|None|yes|
+|path|The container and path where data is located.|None|yes|
+
+
+The `properties` property is:
+Either set sharedAccessStorageToken OR key OR 
appRegistrationClientId/appRegistrationClientSecret/tenantId OR set nothing.

Review Comment:
   ```suggestion
   The `properties` property can be one of the following:
   
   - `sharedAccessStorageToken`
   - `key` 
   - `appRegistrationClientId`, `appRegistrationClientSecret`, and `tenantId` 
   - empty
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Azure multi read options (druid)

Reply via email to