[GitHub] [spark] HyukjinKwon commented on a change in pull request #32745: [SPARK-35523] Fix the default value in Data Source Options page

GitBox Fri, 04 Jun 2021 01:06:14 -0700


HyukjinKwon commented on a change in pull request #32745:
URL: https://github.com/apache/spark/pull/32745#discussion_r644497726




##########
File path: docs/sql-data-sources-text.md
##########
@@ -57,7 +57,7 @@ Data source options of text can be set via:
   </tr>

Review comment:
       Can we change `wholetext`'s default value to `<code>false</code>`

##########
File path: docs/sql-data-sources-jdbc.md
##########
@@ -111,115 +120,144 @@ logging into the data sources.
       partition stride, not for filtering the rows in table. So all rows in 
the table will be
       partitioned and returned. This option applies only to reading.
     </td>
+    <td>read</td>
   </tr>
 
   <tr>
-     <td><code>numPartitions</code></td>
-     <td>
-       The maximum number of partitions that can be used for parallelism in 
table reading and
-       writing. This also determines the maximum number of concurrent JDBC 
connections.
-       If the number of partitions to write exceeds this limit, we decrease it 
to this limit by
-       calling <code>coalesce(numPartitions)</code> before writing.
-     </td>
+    <td><code>numPartitions</code></td>
+    <td>(none)</td>
+    <td>
+      The maximum number of partitions that can be used for parallelism in 
table reading and
+      writing. This also determines the maximum number of concurrent JDBC 
connections.
+      If the number of partitions to write exceeds this limit, we decrease it 
to this limit by
+      calling <code>coalesce(numPartitions)</code> before writing.
+    </td>
+    <td>read/write</td>
   </tr>
 
   <tr>
     <td><code>queryTimeout</code></td>
+    <td><code>0</code></td>
     <td>
       The number of seconds the driver will wait for a Statement object to 
execute to the given
       number of seconds. Zero means there is no limit. In the write path, this 
option depends on
       how JDBC drivers implement the API <code>setQueryTimeout</code>, e.g., 
the h2 JDBC driver
       checks the timeout of each query instead of an entire JDBC batch.
-      It defaults to <code>0</code>.
     </td>
+    <td>read/write</td>
   </tr>
 
   <tr>
     <td><code>fetchsize</code></td>
+    <td><code>0</code></td>
     <td>
-      The JDBC fetch size, which determines how many rows to fetch per round 
trip. This can help performance on JDBC drivers which default to low fetch size 
(e.g. Oracle with 10 rows). This option applies only to reading.
+      The JDBC fetch size, which determines how many rows to fetch per round 
trip. This can help performance on JDBC drivers which default to low fetch size 
(e.g. Oracle with 10 rows).
     </td>
+    <td>read</td>
   </tr>
 
   <tr>
-     <td><code>batchsize</code></td>
-     <td>
-       The JDBC batch size, which determines how many rows to insert per round 
trip. This can help performance on JDBC drivers. This option applies only to 
writing. It defaults to <code>1000</code>.
-     </td>
+    <td><code>batchsize</code></td>
+    <td><code>1000</code></td>
+    <td>
+      The JDBC batch size, which determines how many rows to insert per round 
trip. This can help performance on JDBC drivers. This option applies only to 
writing.
+    </td>
+    <td>write</td>
   </tr>
 
   <tr>
-     <td><code>isolationLevel</code></td>
-     <td>
-       The transaction isolation level, which applies to current connection. 
It can be one of <code>NONE</code>, <code>READ_COMMITTED</code>, 
<code>READ_UNCOMMITTED</code>, <code>REPEATABLE_READ</code>, or 
<code>SERIALIZABLE</code>, corresponding to standard transaction isolation 
levels defined by JDBC's Connection object, with default of 
<code>READ_UNCOMMITTED</code>. This option applies only to writing. Please 
refer the documentation in <code>java.sql.Connection</code>.
-     </td>
+    <td><code>isolationLevel</code></td>
+    <td><code>READ_UNCOMMITTED</code></td>
+    <td>
+      The transaction isolation level, which applies to current connection. It 
can be one of <code>NONE</code>, <code>READ_COMMITTED</code>, 
<code>READ_UNCOMMITTED</code>, <code>REPEATABLE_READ</code>, or 
<code>SERIALIZABLE</code>, corresponding to standard transaction isolation 
levels defined by JDBC's Connection object, with default of 
<code>READ_UNCOMMITTED</code>. Please refer the documentation in 
<code>java.sql.Connection</code>.
+    </td>
+    <td>write</td>
    </tr>
 
   <tr>
-     <td><code>sessionInitStatement</code></td>
-     <td>
-       After each database session is opened to the remote DB and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block). Use this to implement session initialization code. Example: 
<code>option("sessionInitStatement", """BEGIN execute immediate 'alter session 
set "_serial_direct_read"=true'; END;""")</code>
-     </td>
+    <td><code>sessionInitStatement</code></td>
+    <td>(none)</td>
+    <td>
+      After each database session is opened to the remote DB and before 
starting to read data, this option executes a custom SQL statement (or a PL/SQL 
block). Use this to implement session initialization code. Example: 
<code>option("sessionInitStatement", """BEGIN execute immediate 'alter session 
set "_serial_direct_read"=true'; END;""")</code>
+    </td>
+    <td>read</td>
   </tr>
 
   <tr>
     <td><code>truncate</code></td>
+    <td><code>false</code></td>
     <td>
-     This is a JDBC writer related option. When 
<code>SaveMode.Overwrite</code> is enabled, this option causes Spark to 
truncate an existing table instead of dropping and recreating it. This can be 
more efficient, and prevents the table metadata (e.g., indices) from being 
removed. However, it will not work in some cases, such as when the new data has 
a different schema. It defaults to <code>false</code>. This option applies only 
to writing. In case of failures, users should turn off <code>truncate</code> 
option to use <code>DROP TABLE</code> again. Also, due to the different 
behavior of <code>TRUNCATE TABLE</code> among DBMS, it's not always safe to use 
this. MySQLDialect, DB2Dialect, MsSqlServerDialect, DerbyDialect, and 
OracleDialect supports this while PostgresDialect and default JDBCDirect 
doesn't. For unknown and unsupported JDBCDirect, the user option 
<code>truncate</code> is ignored.
+      This is a JDBC writer related option. When 
<code>SaveMode.Overwrite</code> is enabled, this option causes Spark to 
truncate an existing table instead of dropping and recreating it. This can be 
more efficient, and prevents the table metadata (e.g., indices) from being 
removed. However, it will not work in some cases, such as when the new data has 
a different schema. In case of failures, users should turn off 
<code>truncate</code> option to use <code>DROP TABLE</code> again. Also, due to 
the different behavior of <code>TRUNCATE TABLE</code> among DBMS, it's not 
always safe to use this. MySQLDialect, DB2Dialect, MsSqlServerDialect, 
DerbyDialect, and OracleDialect supports this while PostgresDialect and default 
JDBCDirect doesn't. For unknown and unsupported JDBCDirect, the user option 
<code>truncate</code> is ignored.
+    <td>write</td>
    </td>
   </tr>
   
   <tr>
     <td><code>cascadeTruncate</code></td>
+    <td>the default cascading truncate behaviour of the JDBC database in 
question, specified in the <code>isCascadeTruncate</code> in each 
JDBCDialect</td>
     <td>
-        This is a JDBC writer related option. If enabled and supported by the 
JDBC database (PostgreSQL and Oracle at the moment), this options allows 
execution of a <code>TRUNCATE TABLE t CASCADE</code> (in the case of PostgreSQL 
a <code>TRUNCATE TABLE ONLY t CASCADE</code> is executed to prevent 
inadvertently truncating descendant tables). This will affect other tables, and 
thus should be used with care. This option applies only to writing. It defaults 
to the default cascading truncate behaviour of the JDBC database in question, 
specified in the <code>isCascadeTruncate</code> in each JDBCDialect.
+      This is a JDBC writer related option. If enabled and supported by the 
JDBC database (PostgreSQL and Oracle at the moment), this options allows 
execution of a <code>TRUNCATE TABLE t CASCADE</code> (in the case of PostgreSQL 
a <code>TRUNCATE TABLE ONLY t CASCADE</code> is executed to prevent 
inadvertently truncating descendant tables). This will affect other tables, and 
thus should be used with care.
     </td>
+    <td>write</td>
   </tr>
 
   <tr>
     <td><code>createTableOptions</code></td>
+    <td><code>""</code></td>

Review comment:
       ```suggestion
       <td><code></code></td>
   ```

##########
File path: docs/sql-data-sources-json.md
##########
@@ -114,62 +114,62 @@ Data source options of JSON can be set via:
   <tr>
     <!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. 
-->
     <td><code>timeZone</code></td>
-    <td>None</td>
+    <td>The SQL config <code>spark.sql.session.timeZone</code></td>

Review comment:
       ```suggestion
       <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
   ```

##########
File path: docs/sql-data-sources-json.md
##########
@@ -114,62 +114,62 @@ Data source options of JSON can be set via:
   <tr>
     <!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. 
-->
     <td><code>timeZone</code></td>
-    <td>None</td>
+    <td>The SQL config <code>spark.sql.session.timeZone</code></td>

Review comment:
       to match with https://spark.apache.org/docs/latest/configuration.html

##########
File path: docs/sql-data-sources-json.md
##########
@@ -114,62 +114,62 @@ Data source options of JSON can be set via:
   <tr>
     <!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. 
-->
     <td><code>timeZone</code></td>
-    <td>None</td>
+    <td>The SQL config <code>spark.sql.session.timeZone</code></td>

Review comment:
       can you fix other instances too? e.g. in CSV as well 
https://github.com/apache/spark/blob/73fd6de9a18e8b550fd9afbcf9c87efa598fd76e/docs/sql-data-sources-csv.md

##########
File path: docs/sql-data-sources-json.md
##########
@@ -114,62 +114,62 @@ Data source options of JSON can be set via:
   <tr>
     <!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. 
-->
     <td><code>timeZone</code></td>
-    <td>None</td>
+    <td>The SQL config <code>spark.sql.session.timeZone</code></td>

Review comment:
       and parquet 
https://github.com/apache/spark/blob/73fd6de9a18e8b550fd9afbcf9c87efa598fd76e/docs/sql-data-sources-parquet.md

##########
File path: docs/sql-data-sources-json.md
##########
@@ -114,62 +114,62 @@ Data source options of JSON can be set via:
   <tr>
     <!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. 
-->
     <td><code>timeZone</code></td>
-    <td>None</td>
+    <td>The SQL config <code>spark.sql.session.timeZone</code></td>

Review comment:
       and avro 
https://github.com/apache/spark/blob/73fd6de9a18e8b550fd9afbcf9c87efa598fd76e/docs/sql-data-sources-avro.md




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #32745: [SPARK-35523] Fix the default value in Data Source Options page

Reply via email to