Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

via GitHub Thu, 12 Oct 2023 21:45:29 -0700


sandip-db commented on code in PR #43350:
URL: https://github.com/apache/spark/pull/43350#discussion_r1357498046



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one 
or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” 
column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of 
reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+

Review Comment:
   Add python example



##########
examples/src/main/resources/people.xml:
##########
@@ -0,0 +1,15 @@
+<?xml version="1.0"?>
+<ROWSET>

Review Comment:
   ```suggestion
   <people>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one 
or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” 
column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of 
reading or writing, such as controlling behavior of the line separator,
+compression, and so on.

Review Comment:
   ```suggestion
   ```suggestion
   When reading a XML file, the `rowTag` option need to be specified to 
indicate the XML element that maps to a `DataFrame row`. The option() function 
can be used to customize the behavior of reading or writing, such as 
controlling behavior of the XML attributes, XSD validation, compression, and so 
on.
   ```



##########
examples/src/main/resources/people.xml:
##########
@@ -0,0 +1,15 @@
+<?xml version="1.0"?>
+<ROWSET>
+    <ROW>

Review Comment:
   ```suggestion
       <person>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one 
or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” 
column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of 
reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:

Review Comment:
   ```suggestion
   Data source options of XML can be set via:
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one 
or more xml files into a Spark DataFrame, and dataframe.write().xml("

Review Comment:
   ```suggestion
   Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read a 
file or directory of files in XML format into a Spark DataFrame, and 
dataframe.write().xml("
   ```



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,53 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", 
connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) 
encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.xml(path)

Review Comment:
   ```suggestion
       val peopleDF = spark.read.option("rowTag", "person").xml(path)
   ```



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,53 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", 
connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) 
encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.xml(path)
+
+    // The inferred schema can be visualized using the printSchema() method
+    peopleDF.printSchema()
+    // root
+    //  |-- age: long (nullable = true)
+    //  |-- name: string (nullable = true)
+
+    // Creates a temporary view using the DataFrame
+    peopleDF.createOrReplaceTempView("people")
+
+    // SQL statements can be run by using the sql methods provided by spark
+    val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 
13 AND 19")
+    teenagerNamesDF.show()
+    // +------+
+    // |  name|
+    // +------+
+    // |Justin|
+    // +------+
+
+    // Alternatively, a DataFrame can be created for a XML dataset represented 
by a Dataset[String]
+    val otherPeopleDataset = spark.createDataset(
+      """
+        |<ROW>

Review Comment:
   ```suggestion
           |<person>
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one 
or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” 
column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of 
reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_json`
+    * `to_json`
+    * `schema_of_json`

Review Comment:
   ```suggestion
       * `from_xml`
       * `to_xml`
       * `schema_of_xml`
   ```



##########
docs/sql-data-sources-xml.md:
##########
@@ -0,0 +1,222 @@
+---
+layout: global
+title: XML Files
+displayTitle: XML Files
+license: |
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+---
+
+Spark SQL provides spark.read().xml("file_1_path","file_2_path") to read one 
or more xml files into a Spark DataFrame, and dataframe.write().xml("
+path") to write to a xml file.
+When reading a text file, each line becomes each row that has string “value” 
column by default. The line separator can be changed as shown in the
+example below. The option() function can be used to customize the behavior of 
reading or writing, such as controlling behavior of the line separator,
+compression, and so on.
+
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% include_example xml_dataset 
scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% include_example xml_dataset 
java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
+</div>
+
+</div>
+
+## Data Source Option
+
+Data source options of JSON can be set via:
+
+* the `.option`/`.options` methods of
+    * `DataFrameReader`
+    * `DataFrameWriter`
+    * `DataStreamReader`
+    * `DataStreamWriter`
+* the built-in functions below
+    * `from_json`
+    * `to_json`
+    * `schema_of_json`
+* `OPTIONS` clause at [CREATE TABLE USING 
DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
+
+<table class="table table-striped">

Review Comment:
   Please update the table according to the information provided here:
   
   Option | Description | Scope
   --- |--- | ---
   rowTag | The row tag of your xml files to treat as a row. For example, in 
this xml: `<books> <book><book> ...</books>` the appropriate value would be 
book. Default: ROW | read
   samplingRatio | Defines fraction of rows used for schema inferring. XML 
built-in functions ignore this option. Default is 1.0. | read
   excludeAttribute | Whether to exclude attributes in elements. Default: false 
| read
   mode | Allows a mode for dealing with corrupt records during 
parsing.<br>`PERMISSIVE`: when it meets a corrupted record, puts the malformed 
string into a field configured by `columnNameOfCorruptRecord`, and sets 
malformed fields to null. To keep corrupt records, an user can set a string 
type field named `columnNameOfCorruptRecord` in an user-defined schema. If a 
schema does not have the field, it drops corrupt records during parsing. When 
inferring a schema, it implicitly adds a `columnNameOfCorruptRecord` field in 
an output schema.<br>`DROPMALFORMED`: ignores the whole corrupted records. This 
mode is unsupported in the XML built-in functions.<br>`FAILFAST`: throws an 
exception when it meets corrupted records. | read
   inferSchema | If `true`, attempts to infer an appropriate type for each 
resulting DataFrame column. If `false`, all resulting columns are of string 
type. Default is `true`. XML built-in functions ignore this option. | read
   columnNameOfCorruptRecord | Allows renaming the new field having a malformed 
string created by `PERMISSIVE` mode. Default: 
`spark.sql.columnNameOfCorruptRecord` | read
   attributePrefix | The prefix for attributes to differentiate attributes from 
elements. This will be the prefix for field names. Default is `_`. Can be empty 
for reading XML, but not for writing. | read / write
   valueTag | The tag used for the value when there are attributes in the 
element having no child. Default is `_VALUE`. | read / write
   encoding | For reading, decodes the XML files by the given encoding type. 
For writing, specifies encoding (charset) of saved XML files. XML built-in 
functions ignore this option. Default is `UTF-8` | read / write
   ignoreSurroundingSpaces | Defines whether surrounding whitespaces from 
values being read should be skipped. Default is `false`. | read
   rowValidationXSDPath | Path to an optional XSD file that is used to validate 
the XML for each row individually. Rows that fail to validate are treated like 
parse errors as above. The XSD does not otherwise affect the schema provided, 
or inferred. | read
   ignoreNamespace | If true, namespaces prefixes on XML elements and 
attributes are ignored. Tags `<abc:author>` and `<def:author>` would, for 
example, be treated as if both are just `<author>`. Note that, at the moment, 
namespaces cannot be ignored on the rowTag element, only its children. Note 
that XML parsing is in general not namespace-aware even if false. Defaults to 
`false`. | read
   timeZone |(Defaults to `spark.sql.session.timeZone` configuration)<br>Sets 
the string that indicates a time zone ID to be used to format timestamps in the 
JSON datasources or partition values. The following formats of `timeZone` are 
supported:<br>    <ul>      <li>Region-based zone ID: It should have the form 
'area/city', such as 'America/Los_Angeles'.</li>      <li>Zone offset: It 
should be in the format '(+\|-)HH:mm', for example '-08:00' or '+01:00'. Also 
'UTC' and 'Z' are supported as aliases of '+00:00'.</li>    </ul>    Other 
short names like 'CST' are not recommended to use because they can be 
ambiguous.  | read / write
   timestampFormat | Custom timestamp format string that follows the datetime 
pattern format. This applies to timestamp type. Default: 
`yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` | read / write
   dateFormat | Custom date format string that follows the datetime pattern 
format. This applies to date type. Default: `yyyy-MM-dd` | read / write
   locale | Sets a locale as a language tag in IETF BCP 47 format. For 
instance, locale is used while parsing dates and timestamps. Default: `en-US` | 
read
   rootTag | Root tag of the xml files. For example, in `<books> <book><book> 
...</books>`, the appropriate value would be `books`. It can include basic 
attributes by specifying a value like `books foo="bar"`. Default is `ROWS`. | 
write
   declaration | Content of XML declaration to write at the start of every 
output XML file, before the rootTag. For example, a value of foo causes `<?xml 
foo?>` to be written. Set to empty string to suppress. Defaults to 
`version="1.0" encoding="UTF-8" standalone="yes"`. | write
   arrayElementName | Name of XML element that encloses each element of an 
array-valued column when writing. Default is `item` | write
   nullValue | Sets the string representation of a `null` value. Default is 
string `null`. When this is `null`, it does not write attributes and elements 
for fields. | read/ write
   wildcardColName | Name of a column existing in the provided schema which is 
interpreted as a 'wildcard'. It must have type string or array of strings. It 
will match any XML child element that is not otherwise matched by the schema. 
The XML of the child becomes the string value of the column. If an array, then 
all unmatched elements will be returned as an array of strings. As its name 
implies, it is meant to emulate XSD's `xs:any` type. Default is `xs_any`. | read
   compression | Compression codec to use when saving to file. This can be one 
of the known case-insensitive shortened names (none, `bzip2`, `gzip`, `lz4`, 
`snappy` and `deflate`). XML built-in functions ignore this option. Default: 
`none` | write



##########
examples/src/main/scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala:
##########
@@ -418,4 +419,53 @@ object SQLDataSourceExample {
       .jdbc("jdbc:postgresql:dbserver", "schema.tablename", 
connectionProperties)
     // $example off:jdbc_dataset$
   }
+
+  private def runXmlDatasetExample(spark: SparkSession): Unit = {
+    // $example on:xml_dataset$
+    // Primitive types (Int, String, etc) and Product types (case classes) 
encoders are
+    // supported by importing this when creating a Dataset.
+    import spark.implicits._
+    // An XML dataset is pointed to by path.
+    // The path can be either a single xml file or more xml files
+    val path = "examples/src/main/resources/people.xml"
+    val peopleDF = spark.read.xml(path)
+
+    // The inferred schema can be visualized using the printSchema() method
+    peopleDF.printSchema()
+    // root
+    //  |-- age: long (nullable = true)
+    //  |-- name: string (nullable = true)
+
+    // Creates a temporary view using the DataFrame
+    peopleDF.createOrReplaceTempView("people")
+
+    // SQL statements can be run by using the sql methods provided by spark
+    val teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 
13 AND 19")
+    teenagerNamesDF.show()
+    // +------+
+    // |  name|
+    // +------+
+    // |Justin|
+    // +------+
+
+    // Alternatively, a DataFrame can be created for a XML dataset represented 
by a Dataset[String]
+    val otherPeopleDataset = spark.createDataset(
+      """
+        |<ROW>
+        |    <name>laglangyue</name>
+        |    <job>Developer</job>
+        |    <age>28</age>
+        |</ROW>
+        |""".stripMargin :: Nil)
+    val otherPeople = spark.read
+      .option("rowTag", "ROW")

Review Comment:
   ```suggestion
         .option("rowTag", "person")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-44752][SQL] XML: Update Spark Docs [spark]

Reply via email to