[GitHub] [spark] HeartSaVioR commented on a change in pull request #31384: [SPARK-31816][SQL][DOCS] Added high level description about JDBC connection providers for users/developers

GitBox Wed, 03 Feb 2021 15:44:03 -0800


HeartSaVioR commented on a change in pull request #31384:
URL: https://github.com/apache/spark/pull/31384#discussion_r569819667




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:
+ * Embedded CPs added which support kerberos authentication using `keytab` and 
`principal` (but only if the JDBC driver supports keytab)
+ * `org.apache.spark.sql.jdbc.JdbcConnectionProvider` developer API added 
which allows developers
+   to implement any kind of database/use-case specific authentication method.
+
+## How JDBC connection providers loaded?
+
+CPs are loaded with service loader independently. So, if one CP is failed to 
load it has no
+effect on all other CPs.
+
+## How to disable JDBC connection providers?
+
+There are cases where the embedded CP doesn't provide the exact feature which 
needed
+so they can be turned off and can be replaced with custom implementation. All 
CPs must provide a `name`
+which must be unique. One can set the following configuration entry in 
`SparkConf` to turn off CPs:
+`spark.sql.sources.disabledJdbcConnProviderList=name1,name2`.
+
+## How a JDBC connection provider found when new connection initiated?
+
+CPs has a mandatory API which must be implemented:
+
+`def canHandle(driver: Driver, options: Map[String, String]): Boolean`
+
+If this function returns `true` then `Spark` considers the CP can handle the 
connection setup.
+Embedded CPs returning `true` in the following cases:
+* If the connection is not secure (no `keytab` or `principal` provided) then 
the `basic` named CP responds.
+* If the connection is secure (`keytab` and `principal` provided) then the 
database specific CP responds.
+  Database specific providers are checking the JDBC driver class name and the 
decision is made based
+  on that. For example `PostgresConnectionProvider` responds only when the 
driver class name is `org.postgresql.Driver`.
+
+Important to mention that exactly one CP can return `true` from `canHandle` 
for a particular connection
+request because otherwise `Spark` can't decide which CP need to be used to 
make the connection.
+Such cases exception is thrown and the data processing stops.
+
+## How to implement a custom JDBC connection provider?
+
+I've added an example CP to the examples project (which does nothing).

Review comment:
       Let's avoid using `I` in the doc like we do avoid using `@author` in 
code. Probably simpler to say `"Spark provides an example CP  in the examples 
project (which does nothing)."`.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist

Review comment:
       Looks like the below content doesn't match "why use them?" - below 
content should go to the another chapter, like "How to configure security on 
JDBC connection providers?".
   
   If you'd like to say the reason to use JDBC CP is that it's easy to 
configure security on kerberos authentication, it seems better we can simply 
say so, and separate the details on another chapter.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:

Review comment:
       Spark provides two ways to deal with stronger authentication:

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:
+ * Embedded CPs added which support kerberos authentication using `keytab` and 
`principal` (but only if the JDBC driver supports keytab)

Review comment:
       Let's use either "built-in" or "embedded" and try to be consistent with 
sql jdbc doc.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:
+ * Embedded CPs added which support kerberos authentication using `keytab` and 
`principal` (but only if the JDBC driver supports keytab)
+ * `org.apache.spark.sql.jdbc.JdbcConnectionProvider` developer API added 
which allows developers
+   to implement any kind of database/use-case specific authentication method.
+
+## How JDBC connection providers loaded?
+
+CPs are loaded with service loader independently. So, if one CP is failed to 
load it has no
+effect on all other CPs.
+
+## How to disable JDBC connection providers?
+
+There are cases where the embedded CP doesn't provide the exact feature which 
needed
+so they can be turned off and can be replaced with custom implementation. All 
CPs must provide a `name`
+which must be unique. One can set the following configuration entry in 
`SparkConf` to turn off CPs:
+`spark.sql.sources.disabledJdbcConnProviderList=name1,name2`.
+
+## How a JDBC connection provider found when new connection initiated?
+
+CPs has a mandatory API which must be implemented:
+
+`def canHandle(driver: Driver, options: Map[String, String]): Boolean`
+
+If this function returns `true` then `Spark` considers the CP can handle the 
connection setup.
+Embedded CPs returning `true` in the following cases:
+* If the connection is not secure (no `keytab` or `principal` provided) then 
the `basic` named CP responds.
+* If the connection is secure (`keytab` and `principal` provided) then the 
database specific CP responds.
+  Database specific providers are checking the JDBC driver class name and the 
decision is made based
+  on that. For example `PostgresConnectionProvider` responds only when the 
driver class name is `org.postgresql.Driver`.
+
+Important to mention that exactly one CP can return `true` from `canHandle` 
for a particular connection

Review comment:
       can -> should or must (as it sounds requirement)

##########
File path: docs/sql-data-sources-jdbc.md
##########
@@ -213,6 +213,20 @@ the following case-insensitive options:
   </tr>
 </table>
 
+Note that kerberos authentication with keytab is not always supported by the 
JDBC driver.<br>
+Before using <code>keytab</code> and <code>principal</code> configuration 
options, please make sure the following requirements are met:
+* The included JDBC driver version supports kerberos authentication with 
keytab. 
+* There is a built-in connection provider which supports the used database.
+
+There is a built-in connection provider for the following databases:

Review comment:
       There're built-in connection providers




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #31384: [SPARK-31816][SQL][DOCS] Added high level description about JDBC connection providers for users/developers

Reply via email to