Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

via GitHub Fri, 13 Oct 2023 09:09:42 -0700


sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1358449841



##########
examples/src/main/resources/people.xml:
##########
@@ -0,0 +1,15 @@
+<?xml version="1.0"?>
+<people>
+    <person>
+        <name>Michael</name>
+        <age>29</age>
+    </person>
+    <person>
+        <name>Andy</name>
+        <age>30</age>
+    </person>
+    <person>
+        <name>Justin</name>
+        <age>19</age>
+    </person>
+</people>

Review Comment:
   ```suggestion
   </people>
   
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a 
file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate 
the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as 
controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this 
xml: <code><books> <book></book> ...</books></code> the appropriate value would 
be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in 
functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> 
field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. 
This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type. Default 
is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by 
PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Default is _. Can be empty for reading 
XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element 
having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For 
writing, specifies encoding (charset) of saved XML files. XML built-in 
functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should 
be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for 
each row individually. Rows that fail to validate are treated like parse errors 
as above. The XSD does not otherwise affect the schema provided, or 
inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are 
ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be 
treated as if both are just &lt;author>. Note that, at the moment, namespaces 
cannot be ignored on the rowTag element, only its children. Note that XML 
parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format 
timestamps in the JSON datasources or partition values. The following formats 
of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 
'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example 
'-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of 
'+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can 
be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern 
format. This applies to timestamp type.</td>

Review Comment:
   ```suggestion
       <td>Sets the string that indicates a timestamp format. Custom date 
formats follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to timestamp type.</td>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a 
file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate 
the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as 
controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this 
xml: <code><books> <book></book> ...</books></code> the appropriate value would 
be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in 
functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> 
field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. 
This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type. Default 
is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by 
PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Default is _. Can be empty for reading 
XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element 
having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For 
writing, specifies encoding (charset) of saved XML files. XML built-in 
functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should 
be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for 
each row individually. Rows that fail to validate are treated like parse errors 
as above. The XSD does not otherwise affect the schema provided, or 
inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are 
ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be 
treated as if both are just &lt;author>. Note that, at the moment, namespaces 
cannot be ignored on the rowTag element, only its children. Note that XML 
parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format 
timestamps in the JSON datasources or partition values. The following formats 
of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 
'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example 
'-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of 
'+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can 
be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern 
format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. 
This applies to date type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>locale</code></td>
+    <td><code>en-US</code></td>
+    <td>Sets a locale as a language tag in IETF BCP 47 format. For instance, 
locale is used while parsing dates and timestamps. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+      <td><code>rootTag</code></td>
+      <td>ROWS</td>
+      <td>Root tag of the xml files. For example, in <code><books> 
<book></book> ...</books></code>, the appropriate value would be books. It can 
include basic attributes by specifying a value like books foo="bar".</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>declaration</code></td>
+      <td><code>version="1.0" encoding="UTF-8" standalone="yes"</code></td>
+      <td>Content of XML declaration to write at the start of every output XML 
file, before the rootTag. For example, a value of foo causes <?xml foo?> to be 
written. Set to empty string to suppress</td>
+      <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>arrayElementName</code></td>
+    <td>item</td>
+    <td>Name of XML element that encloses each element of an array-valued 
column when writing.</td>
+    <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>nullValue</code></td>
+    <td>null</td>
+    <td>Sets the string representation of a null value. Default is string 
null. When this is null, it does not write attributes and elements for 
fields.</td>
+    <td>read</td>

Review Comment:
   ```suggestion
       <td>read/write</td>
   ```



##########
examples/src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:
##########
@@ -101,14 +103,15 @@ public static void main(String[] args) {
       .config("spark.some.config.option", "some-value")
       .getOrCreate();
 
-    runBasicDataSourceExample(spark);
-    runGenericFileSourceOptionsExample(spark);
-    runBasicParquetExample(spark);
-    runParquetSchemaMergingExample(spark);
-    runJsonDatasetExample(spark);
-    runCsvDatasetExample(spark);
-    runTextDatasetExample(spark);
-    runJdbcDatasetExample(spark);
+//    runBasicDataSourceExample(spark);
+//    runGenericFileSourceOptionsExample(spark);
+//    runBasicParquetExample(spark);
+//    runParquetSchemaMergingExample(spark);
+//    runJsonDatasetExample(spark);
+//    runCsvDatasetExample(spark);
+//    runTextDatasetExample(spark);
+//    runJdbcDatasetExample(spark);

Review Comment:
   uncomment



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,54 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", 
connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) 
encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.option("rowTag", "person").xml(path)
+
+    // The inferred schema can be visualized using the printSchema() method
+    peopleDF.printSchema()
+    // root
+    //  |-- age: long (nullable = true)
+    //  |-- name: string (nullable = true)
+
+    // Creates a temporary view using the DataFrame
+    peopleDF.createOrReplaceTempView("people")
+
+    // SQL statements can be run by using the sql methods provided by spark
+    val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 
13 AND 19")
+    teenagerNamesDF.show()
+    // +------+
+    // |  name|
+    // +------+
+    // |Justin|
+    // +------+
+
+    // Alternatively, a DataFrame can be created for a XML dataset represented 
by a Dataset[String]
+    val otherPeopleDataset = spark.createDataset(
+      """
+        |<person>
+        |    <name>laglangyue</name>
+        |    <job>Developer</job>
+        |    <age>28</age>
+        |</person>
+        |""".stripMargin :: Nil)
+    val otherPeople = spark.read
+      .option("rootTag", "people")

Review Comment:
   remove this line



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a 
file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate 
the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as 
controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this 
xml: <code><books> <book></book> ...</books></code> the appropriate value would 
be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in 
functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> 
field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. 
This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type. Default 
is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by 
PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Default is _. Can be empty for reading 
XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element 
having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For 
writing, specifies encoding (charset) of saved XML files. XML built-in 
functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should 
be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for 
each row individually. Rows that fail to validate are treated like parse errors 
as above. The XSD does not otherwise affect the schema provided, or 
inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are 
ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be 
treated as if both are just &lt;author>. Note that, at the moment, namespaces 
cannot be ignored on the rowTag element, only its children. Note that XML 
parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format 
timestamps in the JSON datasources or partition values. The following formats 
of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 
'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example 
'-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of 
'+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can 
be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern 
format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. 
This applies to date type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>locale</code></td>
+    <td><code>en-US</code></td>
+    <td>Sets a locale as a language tag in IETF BCP 47 format. For instance, 
locale is used while parsing dates and timestamps. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+      <td><code>rootTag</code></td>
+      <td>ROWS</td>
+      <td>Root tag of the xml files. For example, in <code><books> 
<book></book> ...</books></code>, the appropriate value would be books. It can 
include basic attributes by specifying a value like books foo="bar".</td>
+      <td>read</td>

Review Comment:
   ```suggestion
         <td>write</td>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a 
file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate 
the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as 
controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this 
xml: <code><books> <book></book> ...</books></code> the appropriate value would 
be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in 
functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> 
field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. 
This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type. Default 
is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by 
PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Default is _. Can be empty for reading 
XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element 
having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For 
writing, specifies encoding (charset) of saved XML files. XML built-in 
functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should 
be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for 
each row individually. Rows that fail to validate are treated like parse errors 
as above. The XSD does not otherwise affect the schema provided, or 
inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are 
ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be 
treated as if both are just &lt;author>. Note that, at the moment, namespaces 
cannot be ignored on the rowTag element, only its children. Note that XML 
parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format 
timestamps in the JSON datasources or partition values. The following formats 
of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 
'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example 
'-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of 
'+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can 
be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern 
format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. 
This applies to date type.</td>

Review Comment:
   ```suggestion
       <td>Sets the string that indicates a date format. Custom date formats 
follow the formats at <a 
href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html";> 
datetime pattern</a>. This applies to date type.</td>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,224 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides `spark.read().xml("file_1_path","file_2_path")` to read a 
file or directory of files in XML format into a Spark DataFrame,
+and `dataframe.write().xml("path")` to write to a xml file.
+When reading a XML file, the `rowTag` option need to be specified to indicate 
the XML element that maps to a `DataFrame row`. The option() function
+can be used to customize the behavior of reading or writing, such as 
controlling behavior of the XML attributes, XSD validation, compression, and so
+on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of XML can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_xml`
+    * `to_xml`
+    * `schema_of_xml`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">
+  <thead><tr><th><b>Property 
Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr></thead>
+  <tr>
+    <td><code>rowTag</code></td>
+    <td>ROW</td>
+    <td>The row tag of your xml files to treat as a row. For example, in this 
xml: <code><books> <book></book> ...</books></code> the appropriate value would 
be book.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>samplingRatio</code></td>
+    <td><code>1.0</code></td>
+    <td>Defines fraction of rows used for schema inferring. XML built-in 
functions ignore this option.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>excludeAttribute</code></td>
+    <td><code>false</code></td>
+    <td>Whether to exclude attributes in elements.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>mode</code></td>
+    <td><code>PERMISSIVE</code></td>
+    <td>Allows a mode for dealing with corrupt records during parsing.<br>
+    <ul>
+      <li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the 
malformed string into a field configured by 
<code>columnNameOfCorruptRecord</code>, and sets malformed fields to 
<code>null</code>. To keep corrupt records, an user can set a string type field 
named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> 
field in an output schema.</li>
+      <li><code>DROPMALFORMED</code>: ignores the whole corrupted records. 
This mode is unsupported in the JSON built-in functions.</li>
+      <li><code>FAILFAST</code>: throws an exception when it meets corrupted 
records.</li>
+    </ul>
+    </td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>inferSchema</code></td>
+      <td>true</td>
+      <td>If true, attempts to infer an appropriate type for each resulting 
DataFrame column. If false, all resulting columns are of string type. Default 
is true. XML built-in functions ignore this option.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>columnNameOfCorruptRecord</code></td>
+      <td><code>spark.sql.columnNameOfCorruptRecord</code></td>
+      <td>Allows renaming the new field having a malformed string created by 
PERMISSIVE mode.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>attributePrefix</code></td>
+    <td>_</td>
+    <td>The prefix for attributes to differentiate attributes from elements. 
This will be the prefix for field names. Default is _. Can be empty for reading 
XML, but not for writing.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>valueTag</code></td>
+    <td>_VALUE</td>
+    <td>The tag used for the value when there are attributes in the element 
having no child.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>encoding</code></td>
+    <td><code>UTF-8</code></td>
+    <td>For reading, decodes the XML files by the given encoding type. For 
writing, specifies encoding (charset) of saved XML files. XML built-in 
functions ignore this option. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>ignoreSurroundingSpaces</code></td>
+    <td>false</td>
+    <td>Defines whether surrounding whitespaces from values being read should 
be skipped.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>rowValidationXSDPath</code></td>
+      <td>null</td>
+      <td>Path to an optional XSD file that is used to validate the XML for 
each row individually. Rows that fail to validate are treated like parse errors 
as above. The XSD does not otherwise affect the schema provided, or 
inferred.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>ignoreNamespace</code></td>
+      <td>false</td>
+      <td>If true, namespaces prefixes on XML elements and attributes are 
ignored. Tags &lt;abc:author> and &lt;def:author> would, for example, be 
treated as if both are just &lt;author>. Note that, at the moment, namespaces 
cannot be ignored on the rowTag element, only its children. Note that XML 
parsing is in general not namespace-aware even if false.</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>timeZone</code></td>
+    <td>(value of <code>spark.sql.session.timeZone</code> configuration)</td>
+    <td>Sets the string that indicates a time zone ID to be used to format 
timestamps in the JSON datasources or partition values. The following formats 
of <code>timeZone</code> are supported:<br>
+    <ul>
+      <li>Region-based zone ID: It should have the form 'area/city', such as 
'America/Los_Angeles'.</li>
+      <li>Zone offset: It should be in the format '(+|-)HH:mm', for example 
'-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of 
'+00:00'.</li>
+    </ul>
+    Other short names like 'CST' are not recommended to use because they can 
be ambiguous.
+    </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>timestampFormat</code></td>
+    <td><code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code></td>
+    <td>Custom timestamp format string that follows the datetime pattern 
format. This applies to timestamp type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>dateFormat</code></td>
+    <td><code>yyyy-MM-dd</code></td>
+    <td>Custom date format string that follows the datetime pattern format. 
This applies to date type.</td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+    <td><code>locale</code></td>
+    <td><code>en-US</code></td>
+    <td>Sets a locale as a language tag in IETF BCP 47 format. For instance, 
locale is used while parsing dates and timestamps. </td>
+    <td>read/write</td>
+  </tr>
+
+  <tr>
+      <td><code>rootTag</code></td>
+      <td>ROWS</td>
+      <td>Root tag of the xml files. For example, in <code><books> 
<book></book> ...</books></code>, the appropriate value would be books. It can 
include basic attributes by specifying a value like books foo="bar".</td>
+      <td>read</td>
+  </tr>
+
+  <tr>
+      <td><code>declaration</code></td>
+      <td><code>version="1.0" encoding="UTF-8" standalone="yes"</code></td>
+      <td>Content of XML declaration to write at the start of every output XML 
file, before the rootTag. For example, a value of foo causes <?xml foo?> to be 
written. Set to empty string to suppress</td>
+      <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>arrayElementName</code></td>
+    <td>item</td>
+    <td>Name of XML element that encloses each element of an array-valued 
column when writing.</td>
+    <td>write</td>
+  </tr>
+
+  <tr>
+    <td><code>nullValue</code></td>
+    <td>null</td>
+    <td>Sets the string representation of a null value. Default is string 
null. When this is null, it does not write attributes and elements for 
fields.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>wildcardColName</code></td>
+    <td>xs_any</td>
+    <td>Name of a column existing in the provided schema which is interpreted 
as a 'wildcard'. It must have type string or array of strings. It will match 
any XML child element that is not otherwise matched by the schema. The XML of 
the child becomes the string value of the column. If an array, then all 
unmatched elements will be returned as an array of strings. As its name 
implies, it is meant to emulate XSD's xs:any type.</td>
+    <td>read</td>
+  </tr>
+
+  <tr>
+    <td><code>compression</code></td>
+    <td>none</td>
+    <td>Compression codec to use when saving to file. This can be one of the 
known case-insensitive shortened names (none, bzip2, gzip, lz4, snappy and 
deflate). XML built-in functions ignore this option.</td>
+    <td>read</td>

Review Comment:
   ```suggestion
       <td>write</td>
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [WIP][SPARK-44752][SQL] XML: Update Spark Docs [spark]

Reply via email to