This is an automated email from the ASF dual-hosted git repository.
cgivre pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/drill.git
The following commit(s) were added to refs/heads/master by this push:
new 7902357251 DRILL-8356: Add File Name to GoogleSheets Plugin (#2704)
7902357251 is described below
commit 7902357251ea34a0461beb5304563dde7dd9992a
Author: Charles S. Givre <[email protected]>
AuthorDate: Thu Nov 10 09:01:42 2022 -0500
DRILL-8356: Add File Name to GoogleSheets Plugin (#2704)
---
contrib/storage-googlesheets/README.md | 71 +++++++++++-----------
.../googlesheets/GoogleSheetsBatchReader.java | 23 ++++++-
.../store/googlesheets/GoogleSheetsScanSpec.java | 10 ++-
.../schema/GoogleSheetsDrillSchema.java | 6 +-
.../schema/GoogleSheetsRootSchema.java | 3 +-
.../googlesheets/TestGoogleSheetsQueries.java | 6 +-
6 files changed, 77 insertions(+), 42 deletions(-)
diff --git a/contrib/storage-googlesheets/README.md
b/contrib/storage-googlesheets/README.md
index f3128edb5f..5f26582c99 100644
--- a/contrib/storage-googlesheets/README.md
+++ b/contrib/storage-googlesheets/README.md
@@ -1,8 +1,8 @@
# Google Sheets Connector for Apache Drill
-This connector enables you to query and write to Google Sheets.
+This connector enables you to query and write to Google Sheets.
### Usage Notes:
-This feature should be considered experimental as Google's API for Sheets is
quite complex and amazingly
+This feature should be considered experimental as Google's API for Sheets is
quite complex and amazingly
poorly documented.
## Setup Step 1: Obtain Credential Information from Google
@@ -10,19 +10,19 @@ Ok... this is a pain. Google Sheets uses OAuth2.0 (may it
be quickly deprecated
* Your `clientID`: This is an identifier which uniquely identifies your
application to Google
* Your `client_secret`: You can think of this as your password for your
application to access Google Sheets
-* Your redirect URL: This is the URL to which Google will send the various
access tokens and which you will need later. For a local installation of
Drill, it will be:
+* Your redirect URL: This is the URL to which Google will send the various
access tokens and which you will need later. For a local installation of
Drill, it will be:
`http://localhost:8047/credentials/<plugin name>/update_oauth2_authtoken`.
-1. To obtain the `clientID` and `clientSecret` you will need to obtain the
Google keys, open the [Google Sheets
API](https://console.cloud.google.com/apis/library/sheets.googleapis.com) and
click on the `Enable` button.
+1. To obtain the `clientID` and `clientSecret` you will need to obtain the
Google keys, open the [Google Sheets
API](https://console.cloud.google.com/apis/library/sheets.googleapis.com) and
click on the `Enable` button.
2. Once you've enabled the API, you will be taken to the API Manager. Either
select an existing project or create a new one.
3. Next, navigate to the `Credentials` in the left panel.
-4. Click on `+Create Credentials` at the top of the page. Select `OAuth
client ID` and select `Web Application` or `Desktop` as the type. Follow the
instructions and download
+4. Click on `+Create Credentials` at the top of the page. Select `OAuth
client ID` and select `Web Application` or `Desktop` as the type. Follow the
instructions and download
the JSON file that Google provides.
Drill does not use the JSON file, but you will be cutting and pasting values
from the JSON file into the Drill configuration.
## Setup Step 2: Configure Drill
-Create a storage plugin following the normal procedure for doing so. You can
use the example below as a template. Cut and paste the `clientID` and
`client_secret` from the
+Create a storage plugin following the normal procedure for doing so. You can
use the example below as a template. Cut and paste the `clientID` and
`client_secret` from the
JSON file into your Drill configuration as shown below. Once you've done
that, save the configuration.
```json
@@ -52,21 +52,21 @@ JSON file into your Drill configuration as shown below.
Once you've done that,
}
```
-With the exception of the clientID, client_secret and redirects, you should
not have to modify any of the other parameters in the configuration.
+With the exception of the clientID, client_secret and redirects, you should
not have to modify any of the other parameters in the configuration.
### Other Configuration Parameters
There are two configuration parameters which you may want to adjust:
-* `allTextMode`: This parameter when `true` disables Drill's data type
inferencing for your files. If your data has inconsistent data types, set this
to `true`. Default is
- `true`.
+* `allTextMode`: This parameter when `true` disables Drill's data type
inferencing for your files. If your data has inconsistent data types, set this
to `true`. Default is
+ `true`.
* `extractHeaders`: When `true`, Drill will treat the first row of your data
as headers. When `false` Drill will assign column names like `field_n` for
each column.
### Authenticating with Google
Once you have configured Drill to query Google Sheets, there is one final step
before you can access data. You must authenticate the application (Drill) with
Google Sheets. After you have saved your Google Sheets configuration, navigate
back to the configuration screen for your plugin and click on `Authorize`. A
new window should appear which will prompt you to authenticate with Google
services. Once you have done that, you should be able to query Google Sheets!
See, that wasn't so hard!
### Authentication Modes:
-The Google Sheets plugin supports the `SHARED_USER` and `USER_TRANSLATION`
authentication modes. `SHARED_USER` is as the name implies, one user for
everyone. `USER_TRANSLATION`
-uses different credentials for each individual user. In this case, the
credentials are the OAuth2.0 access tokens.
+The Google Sheets plugin supports the `SHARED_USER` and `USER_TRANSLATION`
authentication modes. `SHARED_USER` is as the name implies, one user for
everyone. `USER_TRANSLATION`
+uses different credentials for each individual user. In this case, the
credentials are the OAuth2.0 access tokens.
At the time of writing, we have not yet documented `USER_TRANSLATION` fully,
however we will update this readme once that is complete.
@@ -77,12 +77,12 @@ Once you have configured Drill to connect to Google Sheets,
querying is very str
The URL below is a public spreadsheet hosted on Google Sheets:
[https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/](https://docs.google.com/spreadsheets/d/1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms/)
-In this URL, the portion `1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms` is the
spreadsheetID. Thus,
+In this URL, the portion `1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms` is the
spreadsheetID. Thus,
if you wanted to query this sheet in Drill, after configuring Drill, you could
do so with the following
query:
```sql
-SELECT *
+SELECT *
FROM googlesheets.`1BxiMVs0XRA5nFMdKvBdBZjgmUUqptlbs74OgvE2upms`.`Class Data`
```
@@ -96,7 +96,7 @@ Note that you must specify the tab name to successfully query
Google Sheets.
If you don't know the names of the available tabs in your GoogleSheets
document, you can query the sheets by index using the `tab[n]` format.
Indexing starts at zero and every Sheets document must have at least one sheet.
Note that this must be enclosed in backticks.
```sql
-SELECT *
+SELECT *
FROM googlesheets.<sheet id>.`tab[0]`
```
@@ -105,24 +105,27 @@ FROM googlesheets.<sheet id>.`tab[0]`
You can obtain a list of available sheets by querying the `INFORMATION_SCHEMA`
as shown below. Assuming that you have a connection to Google Sheets called
`googlesheets`:
```sql
-SELECT *
-FROM `INFORMATION_SCHEMA`.`SCHEMATA`
+SELECT *
+FROM `INFORMATION_SCHEMA`.`SCHEMATA`
WHERE SCHEMA_NAME LIKE 'googlesheets%'
```
Due to rate limits from Google, the tabs are not reported to the
`INFORMATION_SCHEMA`. However, it is possible to obtain a list of all
available tabs with the following query:
```sql
-SELECT _sheets
-FROM googlesheets.`<token>`.`<sheet>`
+SELECT _sheets
+FROM googlesheets.`<token>`.`<sheet>`
LIMIT 1
```
+You can also access the file name with the `_title` field. Note that the file
name is NOT
+unique and should not be used for querying data.
+
### Using Aliases
-Since the sheet IDs from Google are not human readable, one way to make your
life easier is to use Drill's aliasing features to provide a better name for
the actual sheet name.
+Since the sheet IDs from Google are not human readable, one way to make your
life easier is to use Drill's aliasing features to provide a better name for
the actual sheet name.
### Data Types
-Drill's Google Sheets reader will attempt to infer the data types of the
incoming data. As with other connectors, this is an imperfect process since
Google Sheets does not
+Drill's Google Sheets reader will attempt to infer the data types of the
incoming data. As with other connectors, this is an imperfect process since
Google Sheets does not
supply a schema or other information to allow Drill to identify the data types
of a column. At present, here is how Drill will map your data:
* Numbers: All numeric columns will be mapped to `DOUBLE` data types
* Boolean: Columns containing `true/false` will be mapped to the `BOOLEAN`
type
@@ -135,27 +138,27 @@ If the data type inference is not working for you, you
can set the `allTextMode`
As with other plugins, you can provide a schema inline as shown in the example
query below.
```sql
-SELECT *
-FROM table(`googlesheets`.`<your google sheet>`.`MixedSheet`
- (schema => 'inline=(`Col1` VARCHAR, `Col2` INTEGER, `Col3` VARCHAR)'))
+SELECT *
+FROM table(`googlesheets`.`<your google sheet>`.`MixedSheet`
+ (schema => 'inline=(`Col1` VARCHAR, `Col2` INTEGER, `Col3` VARCHAR)'))
LIMIT 5
```
### Column Headers
-When Drill reads Google Sheets, it is assumed that the first row contains
column headers.
-If this is incorrect you can set the `extractHeaders` parameter to `false`and
Drill will name each field `field_n` where `n` is the column index.
+When Drill reads Google Sheets, it is assumed that the first row contains
column headers.
+If this is incorrect you can set the `extractHeaders` parameter to `false`and
Drill will name each field `field_n` where `n` is the column index.
# Writing Data To Google Sheets
-When Drill is connected to Google Sheets, you can also write data to Google
Sheets. The basic procedure is
+When Drill is connected to Google Sheets, you can also write data to Google
Sheets. The basic procedure is
the same as with any other data source. Simply write a `CREATE TABLE AS`
(CTAS) query and your data will be
written to Google Sheets.
-One challenge is that once you have created the new sheet, you will have to
manually retrieve the spreadsheet ID
+One challenge is that once you have created the new sheet, you will have to
manually retrieve the spreadsheet ID
from Google Sheets in order to query your new data.
### Dropping Tables
-At the time of implementation, it is only possible to delete tables from
within a Google Sheets document. You may encounter errors if you try to delete
tables from documents
+At the time of implementation, it is only possible to delete tables from
within a Google Sheets document. You may encounter errors if you try to delete
tables from documents
that only have one table in them. The format for deleting a table is:
```sql
@@ -166,17 +169,17 @@ DROP TABLE googlesheets.<sheet id>.<tab name>
### Auto-Aliasing
As of Drill 1.20, Drill allows you to create user and public aliases for
tables and storage plugins. Since Google Sheets
-requires you to use a non-human readable ID to identify the Sheet. One
possible idea to make the Drill connection to Google Sheets
+requires you to use a non-human readable ID to identify the Sheet. One
possible idea to make the Drill connection to Google Sheets
much more usable would be to automatically create an alias (either public)
automatically mapping the unreadable sheetID to the document title.
This could be accomplished after the first query or after a CTAS query.
### Google Drive Integration
-Integrating with Google Drive may allow additional functionality such as
getting the actual document name, deleting documents and a few other basic
functions. However, the
-Google Drive permissions require additional validation from Google.
+Integrating with Google Drive may allow additional functionality such as
getting the actual document name, deleting documents and a few other basic
functions. However, the
+Google Drive permissions require additional validation from Google.
### Additional Pushdowns
-The current implementation supports pushdowns for projection and limit.
+The current implementation supports pushdowns for projection and limit.
The Google Sheets API is quite complex and incredibly poorly documented. In
this author's opinion, it is quite possibly one of the worst APIs he has ever
seen.
-In any event, it may be possible to add filter, sort and perhaps other
pushdowns.
-The current implementation keeps the logic to push filters down to the batch
reader, but does not act on these filters.
+In any event, it may be possible to add filter, sort and perhaps other
pushdowns.
+The current implementation keeps the logic to push filters down to the batch
reader, but does not act on these filters.
If someone figures out how to add the filter pushdowns and wishes to do so,
the query planning logic is all there.
diff --git
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsBatchReader.java
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsBatchReader.java
index 48020f355d..db518fdb6a 100644
---
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsBatchReader.java
+++
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsBatchReader.java
@@ -64,8 +64,9 @@ public class GoogleSheetsBatchReader implements
ManagedReader<SchemaNegotiator>
// 1000 rows would throw invalid request errors.
private static final int BATCH_SIZE = 1000;
private static final String SHEET_COLUMN_NAME = "_sheets";
+ private static final String TITLE_COLUMN_NAME = "_title";
- private static final List<String> IMPLICIT_FIELDS =
Arrays.asList(SHEET_COLUMN_NAME);
+ private static final List<String> IMPLICIT_FIELDS =
Arrays.asList(SHEET_COLUMN_NAME, TITLE_COLUMN_NAME);
private final GoogleSheetsStoragePluginConfig config;
private final GoogleSheetsSubScan subScan;
@@ -77,7 +78,7 @@ public class GoogleSheetsBatchReader implements
ManagedReader<SchemaNegotiator>
private final List<String> sheetNames;
private CustomErrorContext errorContext;
private ScalarWriter sheetNameWriter;
-
+ private ScalarWriter titleNameWriter;
private TupleMetadata schema;
private Map<String, GoogleSheetsColumn> columnMap;
private RowSetLoader rowWriter;
@@ -181,6 +182,12 @@ public class GoogleSheetsBatchReader implements
ManagedReader<SchemaNegotiator>
sheetImplicitColumn.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD,
true);
schema.addColumn(sheetImplicitColumn);
+ ColumnMetadata titleImplicitColumn =
MetadataUtils.newScalar(TITLE_COLUMN_NAME,
+ MinorType.VARCHAR, DataMode.OPTIONAL);
+
titleImplicitColumn.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD,
true);
+ schema.addColumn(titleImplicitColumn);
+
+
negotiator.tableSchema(schema, true);
ResultSetLoader resultLoader = negotiator.build();
// Create ScalarWriters
@@ -308,6 +315,18 @@ public class GoogleSheetsBatchReader implements
ManagedReader<SchemaNegotiator>
for (String sheetName : sheetNames) {
sheetNameWriter.setString(sheetName);
}
+
+ if (titleNameWriter == null) {
+ int titleColumnIndex = rowWriter.tupleSchema().index(TITLE_COLUMN_NAME);
+ if (titleColumnIndex == -1) {
+ ColumnMetadata titleColSchema =
MetadataUtils.newScalar(TITLE_COLUMN_NAME,
+ MinorType.VARCHAR,
+ DataMode.OPTIONAL);
+
titleColSchema.setBooleanProperty(ColumnMetadata.EXCLUDE_FROM_WILDCARD, true);
+ }
+ titleNameWriter = rowWriter.column(TITLE_COLUMN_NAME).scalar();
+ }
+ titleNameWriter.setString(subScan.getScanSpec().getFileName());
}
/**
diff --git
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsScanSpec.java
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsScanSpec.java
index adcfdc3596..4a73be2866 100644
---
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsScanSpec.java
+++
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/GoogleSheetsScanSpec.java
@@ -34,17 +34,20 @@ public class GoogleSheetsScanSpec implements
DrillTableSelection {
private final String tableName;
private final int tabIndex;
private final String pluginName;
+ private final String fileName;
public GoogleSheetsScanSpec(@JsonProperty("sheetID") String sheetID,
@JsonProperty("config")
GoogleSheetsStoragePluginConfig config,
@JsonProperty("tableName") String tableName,
@JsonProperty("pluginName") String pluginName,
- @JsonProperty("tabIndex") int tabIndex) {
+ @JsonProperty("tabIndex") int tabIndex,
+ @JsonProperty("fileName") String fileName) {
this.sheetID = sheetID;
this.config = config;
this.pluginName = pluginName;
this.tableName = tableName;
this.tabIndex = tabIndex;
+ this.fileName = fileName;
}
@JsonProperty("sheetID")
@@ -72,6 +75,11 @@ public class GoogleSheetsScanSpec implements
DrillTableSelection {
return tabIndex;
}
+ @JsonProperty("fileName")
+ public String getFileName() {
+ return fileName;
+ }
+
@Override
public boolean equals(Object obj) {
if (this == obj) {
diff --git
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsDrillSchema.java
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsDrillSchema.java
index 07814dd6f2..911f74bbf8 100644
---
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsDrillSchema.java
+++
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsDrillSchema.java
@@ -62,11 +62,12 @@ public class GoogleSheetsDrillSchema extends AbstractSchema
{
private final SchemaConfig schemaConfig;
private final GoogleSheetsRootSchema parent;
private final String fileToken;
+ private final String fileName;
public GoogleSheetsDrillSchema(AbstractSchema parent, String fileToken,
GoogleSheetsStoragePlugin plugin,
SchemaConfig schemaConfig,
- Sheets sheetsService) {
+ Sheets sheetsService, String fileName) {
super(parent.getSchemaPath(), fileToken);
this.plugin = plugin;
this.schemaConfig = schemaConfig;
@@ -74,6 +75,7 @@ public class GoogleSheetsDrillSchema extends AbstractSchema {
this.parent = (GoogleSheetsRootSchema) parent;
this.sheetsService = sheetsService;
this.tableList = new ArrayList<>();
+ this.fileName = fileName;
}
@Override
@@ -159,7 +161,7 @@ public class GoogleSheetsDrillSchema extends AbstractSchema
{
(GoogleSheetsStoragePluginConfig) plugin.getConfig(),
sheet.getProperties().getTitle(),
plugin.getName(),
- tabList.indexOf(sheet)
+ tabList.indexOf(sheet), fileName
)
)
);
diff --git
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsRootSchema.java
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsRootSchema.java
index 01b5cfa45d..0c00a7ed5e 100644
---
a/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsRootSchema.java
+++
b/contrib/storage-googlesheets/src/main/java/org/apache/drill/exec/store/googlesheets/schema/GoogleSheetsRootSchema.java
@@ -74,7 +74,8 @@ public class GoogleSheetsRootSchema extends AbstractSchema {
Sheets service = plugin.getSheetsService(schemaConfig.getUserName());
// At this point we know we have a valid sheet because we obtained the
Sheet list, so we need to
// add the schema to the schemas list and return it.
- schema = new GoogleSheetsDrillSchema(this, name, plugin, schemaConfig,
service);
+ schema = new GoogleSheetsDrillSchema(this, name, plugin, schemaConfig,
service,
+ tokenMap.get(name));
schemas.put(name, schema);
}
return schema;
diff --git
a/contrib/storage-googlesheets/src/test/java/org/apache/drill/exec/store/googlesheets/TestGoogleSheetsQueries.java
b/contrib/storage-googlesheets/src/test/java/org/apache/drill/exec/store/googlesheets/TestGoogleSheetsQueries.java
index 24eab41040..2410311910 100644
---
a/contrib/storage-googlesheets/src/test/java/org/apache/drill/exec/store/googlesheets/TestGoogleSheetsQueries.java
+++
b/contrib/storage-googlesheets/src/test/java/org/apache/drill/exec/store/googlesheets/TestGoogleSheetsQueries.java
@@ -203,15 +203,17 @@ public class TestGoogleSheetsQueries extends ClusterTest {
fail(e.getMessage());
}
- String sql = String.format("SELECT _sheets FROM
googlesheets.`%s`.`MixedSheet` LIMIT 1", sheetID);
+ String sql = String.format("SELECT _sheets, _title FROM
googlesheets.`%s`.`MixedSheet` LIMIT 1",
+ sheetID);
RowSet results = queryBuilder().sql(sql).rowSet();
TupleMetadata expectedSchema = new SchemaBuilder()
.addArray("_sheets", MinorType.VARCHAR)
+ .addNullable("_title", MinorType.VARCHAR)
.buildSchema();
RowSet expected = client.rowSetBuilder(expectedSchema)
- .addRow((Object) strArray("TestSheet1", "MixedSheet"))
+ .addRow(strArray("TestSheet1", "MixedSheet"), "Drill Test Data")
.build();
new RowSetComparison(expected).verifyAndClearAll(results);