317brian commented on code in PR #15630:
URL: https://github.com/apache/druid/pull/15630#discussion_r1464158396
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
Review Comment:
```suggestion
|uris|JSON array of URIs where the Azure objects to be ingested are located.
Use this format:
`azureStorage://STORAGE_ACCOUNT/CONTAINER/PATH_TO_FILE`|None|One of the
following must be set:`uris`, `prefixes`, or `objects`.|
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`.
Empty objects starting with one of the given prefixes are skipped.|None|`uris`
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or
`objects` must be set|
Review Comment:
```suggestion
|objects|JSON array of Azure objects to ingest.|None|One of the following
must be set:`uris`, `prefixes`, or `objects`.|
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`.
Empty objects starting with one of the given prefixes are skipped.|None|`uris`
or `prefixes` or `objects` must be set|
Review Comment:
```suggestion
|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest. Use this format`azureStorage://STORAGE_ACCOUNT/CONTAINER/PREFIX`. Empty
objects starting with any of the given prefixes are skipped.|None|One of the
following must be set:`uris`, `prefixes`, or `objects`.|
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`.
Empty objects starting with one of the given prefixes are skipped.|None|`uris`
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br
/><br />The glob must match the entire object part, not just the filename. For
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`,
because the object part is `bar/file.json`, and the`*` does not match the
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br
/>For more information, refer to the documentation for
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows.
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`),
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration.
See below for more information.|None|No (defaults will be used if not given)
+
+Note that the Azure input source skips all empty objects only when `prefixes`
is specified.
+
+The `objects` property is:
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|bucket|Name of the Azure Blob Storage or Azure Data Lake storage
account|None|yes|
+|path|The container and path where data is located.|None|yes|
+
+
+The `properties` property is:
+Either set sharedAccessStorageToken OR key OR
appRegistrationClientId/appRegistrationClientSecret/tenantId OR set nothing.
+
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|sharedAccessStorageToken|The plain text string of this Azure Blob Storage
Shared Access Token|None|No|
+|key|The root key of Azure Blob Storage Account|None|no|
+|appRegistrationClientId|The client ID of the Azure App registration to
authenticate as|None|No|
+|appRegistrationClientSecret|The client secret of the Azure App registration
to authenticate as|None|Yes if `appRegistrationClientId` is provided|
+|tenantId|The tenant ID of the Azure App registration to authenticate
as|None|Yes if `appRegistrationClientId` is provided|
+
+<details closed>
+ <summary>The v1 'azure' input source</summary>
+The old `azure` input source did not support specifying which storage account
to ingest from so it has been deprecated.
Review Comment:
```suggestion
<summary>Show the deprecated 'azure' input source</summary>
Note that the deprecated `azure` input source doesn't support specifying
which storage account to ingest from. We recommend using the `azureStorage`
instead.
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`.
Empty objects starting with one of the given prefixes are skipped.|None|`uris`
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br
/><br />The glob must match the entire object part, not just the filename. For
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`,
because the object part is `bar/file.json`, and the`*` does not match the
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br
/>For more information, refer to the documentation for
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
Review Comment:
```suggestion
|objectGlob|A glob for the object part of the Azure URI. In the URI
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br
/><br />The glob must match the entire object part, not just the filename. For
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`
because the object part is `bar/file.json`, and the`*` does not match the
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br
/>For more information, refer to the documentation for
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
Review Comment:
```suggestion
:::info
The old `azure` schema is deprecated. Update your specs to use the
`azureStorage` schema described below instead.
:::
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`.
Empty objects starting with one of the given prefixes are skipped.|None|`uris`
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br
/><br />The glob must match the entire object part, not just the filename. For
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`,
because the object part is `bar/file.json`, and the`*` does not match the
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br
/>For more information, refer to the documentation for
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows.
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`),
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration.
See below for more information.|None|No (defaults will be used if not given)
Review Comment:
```suggestion
|properties|Properties object for overriding the default Azure
configuration. See below for more information.|None|No (defaults will be used
if not given)
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`.
Empty objects starting with one of the given prefixes are skipped.|None|`uris`
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br
/><br />The glob must match the entire object part, not just the filename. For
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`,
because the object part is `bar/file.json`, and the`*` does not match the
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br
/>For more information, refer to the documentation for
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows.
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`),
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration.
See below for more information.|None|No (defaults will be used if not given)
+
+Note that the Azure input source skips all empty objects only when `prefixes`
is specified.
+
+The `objects` property is:
Review Comment:
```suggestion
The `objects` property can one of the following:
```
##########
docs/ingestion/input-sources.md:
##########
@@ -309,6 +309,105 @@ Google Cloud Storage object:
The Azure input source reads objects directly from Azure Blob store or Azure
Data Lake sources. You can
specify objects as a list of file URI strings or prefixes. You can split the
Azure input source for use with [Parallel task](./native-batch.md) indexing and
each worker task reads one chunk of the split data.
+
+The `azure` schema is on a path towards deprecation, use the `azureStorage`
schema instead.
+
+Sample specs:
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "uris": ["azureStorage://storageAccount/container/prefix1/file.json",
"azureStorage://storageAccount/container/prefix2/file2.json"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.parquet",
+ "prefixes": ["azureStorage://storageAccount/container/prefix1/",
"azureStorage://storageAccount/container/prefix2/"]
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+
+```json
+...
+ "ioConfig": {
+ "type": "index_parallel",
+ "inputSource": {
+ "type": "azureStorage",
+ "objectGlob": "**.json",
+ "objects": [
+ { "bucket": "storageAccount", "path":
"container/prefix1/file1.json"},
+ { "bucket": "storageAccount", "path": "container/prefix2/file2.json"}
+ ],
+ "properties": {
+ "sharedAccessStorageToken": "?sv=...<storage token secret>...",
+ }
+ },
+ "inputFormat": {
+ "type": "json"
+ },
+ ...
+ },
+...
+```
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|type|Set the value to `azureStorage`.|None|yes|
+|uris|JSON array of URIs where the Azure objects to be ingested are located,
in the form
`azureStorage://<storageAccount>/<container>/<path-to-file>`|None|`uris` or
`prefixes` or `objects` must be set|
+|prefixes|JSON array of URI prefixes for the locations of Azure objects to
ingest, in the form `azureStorage://<storageAccount>/<container>/<prefix>`.
Empty objects starting with one of the given prefixes are skipped.|None|`uris`
or `prefixes` or `objects` must be set|
+|objects|JSON array of Azure objects to ingest.|None|`uris` or `prefixes` or
`objects` must be set|
+|objectGlob|A glob for the object part of the Azure URI. In the URI
`azureStorage://foo/bar/file.json`, the glob is applied to `bar/file.json`.<br
/><br />The glob must match the entire object part, not just the filename. For
example, the glob `*.json` does not match `azureStorage://foo/bar/file.json`,
because the object part is `bar/file.json`, and the`*` does not match the
slash. To match all objects ending in `.json`, use `**.json` instead.<br /><br
/>For more information, refer to the documentation for
[`FileSystem#getPathMatcher`](https://docs.oracle.com/javase/8/docs/api/java/nio/file/FileSystem.html#getPathMatcher-java.lang.String-).|None|no|
+|systemFields|JSON array of system fields to return as part of input rows.
Possible values: `__file_uri` (Azure blob URI starting with `azureStorage://`),
`__file_bucket` (Azure bucket), and `__file_path` (Azure object path).|None|no|
+|properties|Properties Object for overriding the default Azure configuration.
See below for more information.|None|No (defaults will be used if not given)
+
+Note that the Azure input source skips all empty objects only when `prefixes`
is specified.
+
+The `objects` property is:
+
+|Property|Description|Default|Required|
+|--------|-----------|-------|---------|
+|bucket|Name of the Azure Blob Storage or Azure Data Lake storage
account|None|yes|
+|path|The container and path where data is located.|None|yes|
+
+
+The `properties` property is:
+Either set sharedAccessStorageToken OR key OR
appRegistrationClientId/appRegistrationClientSecret/tenantId OR set nothing.
Review Comment:
```suggestion
The `properties` property can be one of the following:
- `sharedAccessStorageToken`
- `key`
- `appRegistrationClientId`, `appRegistrationClientSecret`, and `tenantId`
- empty
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]