Hi Joseph, Thanks for your explanation. It makes a lot of sense and I found http://spark.apache.org/docs/latest/sql-programming- guide.html#jdbc-to-other-databases giving more.
With that and after I reviewed the code, customSchema option is simply to override the data type of the fields in a relation schema [1][2]. I think the name of the option should be different with the word "override" to give the exact meaning, shouldn't it? With that said, I think the description of customSchema option may slightly be incorrect. For example the following says: "The custom schema to use for reading data from JDBC connectors" and although it is used for reading it merely overrides the data types and may not match the fields at all which makes no difference. Is that correct? It's in the following sentence where the word of "type" appears: "You can also specify partial fields, and the others use the default type mapping." But that begs for another question about "the default type mapping". What the default type mapping is? That was one of my questions when I first found the option. What do you think about the following description of the customSchema option. You're welcome to make further changes if needed. ==== customSchema - Specifies the custom data types of the read schema (that is used at load time). customSchema is a comma-separated list of field definitions with column names and their data types in a canonical SQL representation, e.g. id DECIMAL(38, 0), name STRING. customSchema defines the data types of the columns that will override the data types inferred from the table schema and follows the following pattern: colTypeList : colType (',' colType)* ; colType : identifier dataType (COMMENT STRING)? ; dataType : complex=ARRAY '<' dataType '>' #complexDataType | complex=MAP '<' dataType ',' dataType '>' #complexDataType | complex=STRUCT ('<' complexColTypeList? '>' | NEQ) #complexDataType | identifier ('(' INTEGER_VALUE (',' INTEGER_VALUE)* ')')? #primitiveDataType ; ==== Should I file a JIRA task for this? [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/ src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation. scala?utf8=%E2%9C%93#L116-L118 [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/ src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils. scala#L785-L788 Pozdrawiam, Jacek Laskowski ---- https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Jul 16, 2018 at 4:27 PM, Joseph Torres <joseph.tor...@databricks.com > wrote: > I guess the question is partly about the semantics of > DataFrameReader.schema. If it's supposed to mean "the loaded dataframe will > definitely have exactly this schema", that doesn't quite match the behavior > of the customSchema option. If it's only meant to be an arbitrary schema > input which the source can interpret however it wants, it'd be fine. > > The second semantic is IMO more useful, so I'm in favor here. > > On Mon, Jul 16, 2018 at 3:43 AM, Jacek Laskowski <ja...@japila.pl> wrote: > >> Hi, >> >> I think there is a sort of inconsistency in how DataFrameReader.jdbc >> deals with a user-defined schema as it makes sure that there's no >> user-specified schema [1][2] yet allows for setting one using customSchema >> option [3]. Why is so? Has this been merely overlooked or similar? >> >> I think assertNoSpecifiedSchema should be removed from >> DataFrameReader.jdbc and support for DataFrameReader.schema for jdbc should >> be added (with the customSchema option marked as deprecated to be removed >> in 2.4 or 3.0). >> >> Should I file an issue in Spark JIRA and do the changes? WDYT? >> >> [1] https://github.com/apache/spark/blob/v2.3.1/sql/core/src >> /main/scala/org/apache/spark/sql/DataFrameReader.scala? >> utf8=%E2%9C%93#L249 >> [2] https://github.com/apache/spark/blob/v2.3.1/sql/core/src >> /main/scala/org/apache/spark/sql/DataFrameReader.scala? >> utf8=%E2%9C%93#L320 >> [3] https://github.com/apache/spark/blob/v2.3.1/sql/core/src >> /main/scala/org/apache/spark/sql/execution/datasources/ >> jdbc/JDBCOptions.scala#L167 >> >> Pozdrawiam, >> Jacek Laskowski >> ---- >> https://about.me/JacekLaskowski >> Mastering Spark SQL https://bit.ly/mastering-spark-sql >> Spark Structured Streaming https://bit.ly/spark-structured-streaming >> Mastering Kafka Streams https://bit.ly/mastering-kafka-streams >> Follow me at https://twitter.com/jaceklaskowski >> > >