[GitHub] [spark] HyukjinKwon commented on a change in pull request #33362: [SPARK-36153][SQL] Update transform doc to current code

GitBox Thu, 15 Jul 2021 03:12:55 -0700


HyukjinKwon commented on a change in pull request #33362:
URL: https://github.com/apache/spark/pull/33362#discussion_r670328050




##########
File path: docs/sql-ref-syntax-qry-select-transform.md
##########
@@ -57,16 +65,38 @@ SELECT TRANSFORM ( expression [ , ... ] )
 
     Specifies a command or a path to script to process data.
 
-### SerDe behavior
+### ROW FORMAT DELIMITED BEHAVIOR
+
+When spark use `ROW FORMAT DELIMITED` format, Spark will use `\u0001` as 
default filed delimit,
+use `\n` as default line delimit and use `"\N"` as `NULL` value in order to 
differentiate `NULL` values 
+from empty strings. These delimit can be overridden by `FIELDS TERMINATED BY`, 
`LINES TERMINATED BY` and
+`NULL TERMINATED AS`. Since we use `to_json` and `from_json` to handle complex 
data type, so 
+`COLLECTION ITEMS TERMINATED BY` and `MAP KEYS TERMINATED BY` won't work in 
current code. 
+Spark will cast all columns to `STRING` and combined by tabs before feeding to 
the user script.
+For complex type such as `ARRAY\MAP\STRUCT`, spark use `to_json` cast it to 
input json string
+and use `from_json` to convert result output to `ARRAY/MAP/STRUCT` data. The 
standard output of
+the user script will be treated as tab-separated `STRING` columns, any cell 
containing only `"\N"` 
+will be re-interpreted as a `NULL` value, and then the resulting STRING column 
will be cast to the 
+data type specified in `col_type`. If the actual number of output columns is 
less than the number 
+of specified output columns, insufficient output columns will be supplemented 
with `NULL`. 
+If the actual number of output columns is more than the number of specified 
output columns,
+the output columns will only select the corresponding columns and the 
remaining part will be discarded.
 
-Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` 
by default, so columns will be casted
+If there is no `AS` clause after `USING my_script`, an output schema will be 
`key: STRING, value: STRING`.
+The `key` column contains all the characters before the first tab and the 
`value` column contains the remaining characters after the first tab.
+If there is no enough tab, Spark will return `NULL` value. These defaults can 
be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 

Review comment:
       BTW, avoid using future tense. For example, replace `Spark will return` 
to  `Spark returns`.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HyukjinKwon commented on a change in pull request #33362: [SPARK-36153][SQL] Update transform doc to current code

Reply via email to