[GitHub] [spark] srowen commented on a change in pull request #33362: [SPARK-36153][SQL][DOCS] Update transform doc to match the current code

GitBox Tue, 20 Jul 2021 05:02:43 -0700


srowen commented on a change in pull request #33362:
URL: https://github.com/apache/spark/pull/33362#discussion_r672276502




##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -24,6 +24,15 @@ license: |
 The `TRANSFORM` clause is used to specify a Hive-style transform query 
specification 
 to transform the inputs by running a user-specified command or script.
 
+Spark's script transform supports two modes:
+
+  1. Hive supported disabled: Spark script transform can run without 
`spark.sql.catalogImplementation=true` 

Review comment:
       supported -> support

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -24,6 +24,15 @@ license: |
 The `TRANSFORM` clause is used to specify a Hive-style transform query 
specification 
 to transform the inputs by running a user-specified command or script.
 
+Spark's script transform supports two modes:
+
+  1. Hive supported disabled: Spark script transform can run without 
`spark.sql.catalogImplementation=true` 
+  or `SparkSession.builder.enableHiveSupport()`. In this case, now spark only 
use script transform with 
+  `ROW FORMAT DELIMITED` and treat all values passed to the script as strings. 
+  2. Hive support enabled: When run Spark with 
`spark.sql.catalogImplementation=true` or started Spark SQL 
+  with `SparkSession.builder.enableHiveSupport()`, spark can use script 
transform with Hive SerDe and both 

Review comment:
       spark can use -> Spark can use the
   Hive Serde and both -> both Hive SerDe and

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.

Review comment:
       combine -> combines

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 

Review comment:
       , Spark use -> . Spark uses
   data type -> data types

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"` is re-interpreted
+ as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`.
+ - If the actual number of output columns is more than the number of specified 
output columns,
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.

Review comment:
       Same as above

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -24,6 +24,15 @@ license: |
 The `TRANSFORM` clause is used to specify a Hive-style transform query 
specification 
 to transform the inputs by running a user-specified command or script.
 
+Spark's script transform supports two modes:
+
+  1. Hive supported disabled: Spark script transform can run without 
`spark.sql.catalogImplementation=true` 
+  or `SparkSession.builder.enableHiveSupport()`. In this case, now spark only 
use script transform with 

Review comment:
       spark only use -> Spark only uses the
   treat -> treats

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"` is re-interpreted
+ as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`.
+ - If the actual number of output columns is more than the number of specified 
output columns,
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there is no enough tab, Spark returns `NULL` value.

Review comment:
       no enough -> no
   returns -> returns the

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -24,6 +24,15 @@ license: |
 The `TRANSFORM` clause is used to specify a Hive-style transform query 
specification 
 to transform the inputs by running a user-specified command or script.
 
+Spark's script transform supports two modes:
+
+  1. Hive supported disabled: Spark script transform can run without 
`spark.sql.catalogImplementation=true` 
+  or `SparkSession.builder.enableHiveSupport()`. In this case, now spark only 
use script transform with 
+  `ROW FORMAT DELIMITED` and treat all values passed to the script as strings. 
+  2. Hive support enabled: When run Spark with 
`spark.sql.catalogImplementation=true` or started Spark SQL 

Review comment:
       When run Spark -> When Spark is run
   started Spark SQL -> Spark SQL is started

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.

Review comment:
       convert result -> convert the result

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,

Review comment:
       insufficient -> additional (?)
   supplemented -> filled

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:

Review comment:
       When Spark _uses_...
   (and similarly below)

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.

Review comment:
       as default -> as the default 
   (and below)

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:

Review comment:
       When Hive support is enabled
   Serde -> SerDe

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.

Review comment:
       an output -> the output

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 

Review comment:
       complex type -> complex types
   Spark use -> Spark uses
   cast it -> cast it to
   to input -> to an input
   use -> uses

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"` is re-interpreted
+ as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`.

Review comment:
       Same as above

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted

Review comment:
       casted -> cast

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.

Review comment:
       Is that literally the 4-character string passed? with quotes? and not a 
new line but the two chars `\N`?

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:

Review comment:
       output have -> if the output has
   and there are only two output columns

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.

Review comment:
       I'm asking whether it's correct - that seems surprising

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.

Review comment:
       OK if it's definitely that 4 character string, that's fine.
   BTW is it single or double quotes? your comment differs from the doc.

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.

Review comment:
       That's not my question. You are suggesting here that the value has four 
characters: a quote, a backlash, N, a quote. I'm not sure that's what you mean, 
as you are saying the quote is just a doc convention. If you are code-quoting 
the string, you are implying the whole thing is a literal.
   
   Here's a simple question - how many characters does the actual value have?

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark use `ROW FORMAT DELIMITED` format:
+ - Spark use `\u0001` as default field delimiter and this delimiter can be 
overridden by `FIELDS TERMINATED BY`.
+ - Spark use `\n` as default line delimit and this delimiter can be overridden 
by `LINES TERMINATED BY`.
+ - Spark use `"\N"` as default  `NULL` value in order to differentiate `NULL` 
values 
+ from empty strings. This delimiter can be overridden by `NULL DEFINED AS`.
+ - Spark casts all columns to `STRING` and combine columns by tabs before 
feeding to the user script.
+ For complex type such as `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` cast it 
to input `JSON` string and use 
+ `from_json` to convert result output `JSON` string to `ARRAY`/`MAP`/`STRUCT` 
data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as
+  `ARRAY`/`MAP`/`STRUCT`, Spark use `to_json` and `from_json` to handle 
complex data type with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only `"\N"`
+ is re-interpreted as a `NULL` value, and then the resulting `STRING` column 
will be cast to the data type specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  insufficient output columns will be supplemented with `NULL`. 
+     ```
+     output tabs: `1, 2`
+     output columns: `A: INT, B INT, C: INT`
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, output have three tabs and output columns only have two columns:
+     ```
+     output tabs: `1, 2, 3`
+     output columns: `A: INT, B INT`
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, an output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are not enough tabs, Spark pads `NULL` value.
+      ```
+      output tabs: `1, 2, 3`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: `1, 2`
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When enable Hive and use Hive Serde mode:
+ - Spark uses the Hive SerDe 
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns are 
casted
+ to `STRING` and combined by tabs before feeding to the user script.
+ - All `NULL` values are converted to the literal string `"\N"` in order to 
differentiate `NULL` values from empty strings.

Review comment:
       Right, the literal should be `\N` then




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] srowen commented on a change in pull request #33362: [SPARK-36153][SQL][DOCS] Update transform doc to match the current code

Reply via email to