[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33362: [SPARK-36153][SQL][DOCS] Update transform doc to match the current code

GitBox Tue, 20 Jul 2021 08:35:24 -0700


AngersZhuuuu commented on a change in pull request #33362:
URL: https://github.com/apache/spark/pull/33362#discussion_r673235481




##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses `\u0001` as the default field delimiter and this delimiter can 
be overridden by `FIELDS TERMINATED BY`.
+ - Spark uses `\n` as the default line delimit and this delimiter can be 
overridden by `LINES TERMINATED BY`.
+ - Spark uses literal string `\N` as the default `NULL` value in order to 
differentiate `NULL` values 
+ from literal string `NULL`. This delimiter can be overridden by `NULL DEFINED 
AS`.
+ - Spark casts all columns to `STRING` and combines columns by tabs before 
feeding to the user script.
+ For complex types such as `ARRAY`/`MAP`/`STRUCT`. Spark uses `to_json` cast 
it to an input `JSON` string and use 
+ `from_json` to convert the result output `JSON` string to 
`ARRAY`/`MAP`/`STRUCT` data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as 
+ `ARRAY`/`MAP`/`STRUCT`, Spark uses `to_json` and `from_json` to handle 
complex data types with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only literal string `\N`
+ is re-interpreted as a literal `NULL` value, and then the resulting `STRING` 
column will be cast to the data types specified in `col_type`.
+ - If the actual number of output columns is less than the number of specified 
output columns,
+  additional output columns will be filled with `NULL`. For example:
+     ```
+     output tabs: 1, 2
+     output columns: A: INT, B INT, C: INT
+     result: 
+       +---+---+------+
+       |  a|  b|     c|
+       +---+---+------+
+       |  1|  2|  NULL|
+       +---+---+------+
+     ```
+ - If the actual number of output columns is more than the number of specified 
output columns, 
+ the output columns only select the corresponding columns, and the remaining 
part will be discarded.
+ For example, if the output has three tabs and there are only two output 
columns:
+     ```
+     output tabs: 1, 2, 3
+     output columns: A: INT, B INT
+     result: 
+       +---+---+
+       |  a|  b|
+       +---+---+
+       |  1|  2|
+       +---+---+
+     ```
+ - If there is no `AS` clause after `USING my_script`, the output schema is 
`key: STRING, value: STRING`.
+ The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+ If there are no tabs, Spark returns the `NULL` value. For example:
+      ```
+      output tabs: 1, 2, 3
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|      2|
+        +-----+-------+
+   
+      output tabs: 1, 2
+      output columns: 
+      result: 
+        +-----+-------+
+        |  key|  value|
+        +-----+-------+
+        |    1|   NULL|
+        +-----+-------+
+      ```
+
+### Hive SerDe behavior
+
+When Hive support is enabled and use Hive SerDe mode:

Review comment:
       Done

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses `\u0001` as the default field delimiter and this delimiter can 
be overridden by `FIELDS TERMINATED BY`.
+ - Spark uses `\n` as the default line delimit and this delimiter can be 
overridden by `LINES TERMINATED BY`.
+ - Spark uses literal string `\N` as the default `NULL` value in order to 
differentiate `NULL` values 
+ from literal string `NULL`. This delimiter can be overridden by `NULL DEFINED 
AS`.
+ - Spark casts all columns to `STRING` and combines columns by tabs before 
feeding to the user script.
+ For complex types such as `ARRAY`/`MAP`/`STRUCT`. Spark uses `to_json` cast 
it to an input `JSON` string and use 
+ `from_json` to convert the result output `JSON` string to 
`ARRAY`/`MAP`/`STRUCT` data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as 
+ `ARRAY`/`MAP`/`STRUCT`, Spark uses `to_json` and `from_json` to handle 
complex data types with `JSON` format, so 
+ `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
default row format.
+ - The standard output of the user script is treated as tab-separated `STRING` 
columns, any cell containing only literal string `\N`

Review comment:
       Done

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses `\u0001` as the default field delimiter and this delimiter can 
be overridden by `FIELDS TERMINATED BY`.
+ - Spark uses `\n` as the default line delimit and this delimiter can be 
overridden by `LINES TERMINATED BY`.
+ - Spark uses literal string `\N` as the default `NULL` value in order to 
differentiate `NULL` values 
+ from literal string `NULL`. This delimiter can be overridden by `NULL DEFINED 
AS`.
+ - Spark casts all columns to `STRING` and combines columns by tabs before 
feeding to the user script.
+ For complex types such as `ARRAY`/`MAP`/`STRUCT`. Spark uses `to_json` cast 
it to an input `JSON` string and use 
+ `from_json` to convert the result output `JSON` string to 
`ARRAY`/`MAP`/`STRUCT` data.
+ - `COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` are 
delimiters to split complex data such as 
+ `ARRAY`/`MAP`/`STRUCT`, Spark uses `to_json` and `from_json` to handle 
complex data types with `JSON` format, so 

Review comment:
       Done

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses `\u0001` as the default field delimiter and this delimiter can 
be overridden by `FIELDS TERMINATED BY`.
+ - Spark uses `\n` as the default line delimit and this delimiter can be 
overridden by `LINES TERMINATED BY`.
+ - Spark uses literal string `\N` as the default `NULL` value in order to 
differentiate `NULL` values 
+ from literal string `NULL`. This delimiter can be overridden by `NULL DEFINED 
AS`.
+ - Spark casts all columns to `STRING` and combines columns by tabs before 
feeding to the user script.
+ For complex types such as `ARRAY`/`MAP`/`STRUCT`. Spark uses `to_json` cast 
it to an input `JSON` string and use 

Review comment:
       DOne

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses `\u0001` as the default field delimiter and this delimiter can 
be overridden by `FIELDS TERMINATED BY`.
+ - Spark uses `\n` as the default line delimit and this delimiter can be 
overridden by `LINES TERMINATED BY`.
+ - Spark uses literal string `\N` as the default `NULL` value in order to 
differentiate `NULL` values 

Review comment:
       Done

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses `\u0001` as the default field delimiter and this delimiter can 
be overridden by `FIELDS TERMINATED BY`.
+ - Spark uses `\n` as the default line delimit and this delimiter can be 
overridden by `LINES TERMINATED BY`.

Review comment:
       DOne

##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,19 +66,85 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
-
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
-to `STRING` and combined by tabs before feeding to the user script. All `NULL` 
values will be converted
-to the literal string `"\N"` in order to differentiate `NULL` values from 
empty strings. The standard output of the
-user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` will be re-interpreted
-as a `NULL` value, and then the resulting STRING column will be cast to the 
data type specified in `col_type`. If the actual
-number of output columns is less than the number of specified output columns, 
insufficient output columns will be
-supplemented with `NULL`. If the actual number of output columns is more than 
the number of specified output columns,
-the output columns will only select the corresponding columns and the 
remaining part will be discarded.
-If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
-The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
-If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When Spark uses `ROW FORMAT DELIMITED` format:
+ - Spark uses `\u0001` as the default field delimiter and this delimiter can 
be overridden by `FIELDS TERMINATED BY`.

Review comment:
       DOne




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] AngersZhuuuu commented on a change in pull request #33362: [SPARK-36153][SQL][DOCS] Update transform doc to match the current code

Reply via email to