[GitHub] [spark] gaborgsomogyi commented on a change in pull request #31384: [SPARK-31816][SQL][DOCS] Added high level description about JDBC connection providers for users/developers

GitBox Fri, 05 Feb 2021 01:55:07 -0800


gaborgsomogyi commented on a change in pull request #31384:
URL: https://github.com/apache/spark/pull/31384#discussion_r570119198




##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:
+ * Embedded CPs added which support kerberos authentication using `keytab` and 
`principal` (but only if the JDBC driver supports keytab)
+ * `org.apache.spark.sql.jdbc.JdbcConnectionProvider` developer API added 
which allows developers
+   to implement any kind of database/use-case specific authentication method.
+
+## How JDBC connection providers loaded?
+
+CPs are loaded with service loader independently. So, if one CP is failed to 
load it has no
+effect on all other CPs.
+
+## How to disable JDBC connection providers?
+
+There are cases where the embedded CP doesn't provide the exact feature which 
needed
+so they can be turned off and can be replaced with custom implementation. All 
CPs must provide a `name`
+which must be unique. One can set the following configuration entry in 
`SparkConf` to turn off CPs:
+`spark.sql.sources.disabledJdbcConnProviderList=name1,name2`.
+
+## How a JDBC connection provider found when new connection initiated?
+
+CPs has a mandatory API which must be implemented:
+
+`def canHandle(driver: Driver, options: Map[String, String]): Boolean`
+
+If this function returns `true` then `Spark` considers the CP can handle the 
connection setup.
+Embedded CPs returning `true` in the following cases:
+* If the connection is not secure (no `keytab` or `principal` provided) then 
the `basic` named CP responds.
+* If the connection is secure (`keytab` and `principal` provided) then the 
database specific CP responds.
+  Database specific providers are checking the JDBC driver class name and the 
decision is made based
+  on that. For example `PostgresConnectionProvider` responds only when the 
driver class name is `org.postgresql.Driver`.
+
+Important to mention that exactly one CP can return `true` from `canHandle` 
for a particular connection
+request because otherwise `Spark` can't decide which CP need to be used to 
make the connection.
+Such cases exception is thrown and the data processing stops.
+
+## How to implement a custom JDBC connection provider?
+
+I've added an example CP to the examples project (which does nothing).

Review comment:
       I felt too much ownership and you're right, changed :)

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist

Review comment:
       Restructured this part.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:

Review comment:
       Changed.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:
+ * Embedded CPs added which support kerberos authentication using `keytab` and 
`principal` (but only if the JDBC driver supports keytab)

Review comment:
       Switched to built-in.

##########
File path: sql/core/src/main/scala/org/apache/spark/sql/jdbc/README.md
##########
@@ -0,0 +1,81 @@
+---
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+# JDBC Connection Provider Handling In Spark
+
+This document aims to explain and demystify JDBC connection providers as they 
are used by Spark
+and the usage or custom provider implementation is not obvious.
+
+## What are JDBC connection providers and why use them?
+
+JDBC connection providers (CPs from now on) are making JDBC connections 
initiated by JDBC sources.
+When a Spark source initiates JDBC connection it looks for a CP which supports 
the included driver,
+the user just need to provide the `keytab` location and the `principal`. The 
`keytab` file must exist
+on each node where connection is initiated. The way how CP lookup happens 
described in later chapter.
+
+Spark initially provided non-authenticated or user/password authenticated 
connections.
+This is quite insecure and some of the users expected stronger authentication 
possibilities.
+This need fulfilled in 2 ways:
+ * Embedded CPs added which support kerberos authentication using `keytab` and 
`principal` (but only if the JDBC driver supports keytab)
+ * `org.apache.spark.sql.jdbc.JdbcConnectionProvider` developer API added 
which allows developers
+   to implement any kind of database/use-case specific authentication method.
+
+## How JDBC connection providers loaded?
+
+CPs are loaded with service loader independently. So, if one CP is failed to 
load it has no
+effect on all other CPs.
+
+## How to disable JDBC connection providers?
+
+There are cases where the embedded CP doesn't provide the exact feature which 
needed
+so they can be turned off and can be replaced with custom implementation. All 
CPs must provide a `name`
+which must be unique. One can set the following configuration entry in 
`SparkConf` to turn off CPs:
+`spark.sql.sources.disabledJdbcConnProviderList=name1,name2`.
+
+## How a JDBC connection provider found when new connection initiated?
+
+CPs has a mandatory API which must be implemented:
+
+`def canHandle(driver: Driver, options: Map[String, String]): Boolean`
+
+If this function returns `true` then `Spark` considers the CP can handle the 
connection setup.
+Embedded CPs returning `true` in the following cases:
+* If the connection is not secure (no `keytab` or `principal` provided) then 
the `basic` named CP responds.
+* If the connection is secure (`keytab` and `principal` provided) then the 
database specific CP responds.
+  Database specific providers are checking the JDBC driver class name and the 
decision is made based
+  on that. For example `PostgresConnectionProvider` responds only when the 
driver class name is `org.postgresql.Driver`.
+
+Important to mention that exactly one CP can return `true` from `canHandle` 
for a particular connection

Review comment:
       Changed.

##########
File path: docs/sql-data-sources-jdbc.md
##########
@@ -213,6 +213,20 @@ the following case-insensitive options:
   </tr>
 </table>
 
+Note that kerberos authentication with keytab is not always supported by the 
JDBC driver.<br>
+Before using <code>keytab</code> and <code>principal</code> configuration 
options, please make sure the following requirements are met:
+* The included JDBC driver version supports kerberos authentication with 
keytab. 
+* There is a built-in connection provider which supports the used database.
+
+There is a built-in connection provider for the following databases:

Review comment:
       Nice catch, fixed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gaborgsomogyi commented on a change in pull request #31384: [SPARK-31816][SQL][DOCS] Added high level description about JDBC connection providers for users/developers

Reply via email to