[
https://issues.apache.org/jira/browse/SPARK-26739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dongjoon Hyun updated SPARK-26739:
----------------------------------
Affects Version/s: (was: 2.4.0)
3.0.0
> Standardized Join Types for DataFrames
> --------------------------------------
>
> Key: SPARK-26739
> URL: https://issues.apache.org/jira/browse/SPARK-26739
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 3.0.0
> Reporter: Skyler Lehan
> Priority: Minor
> Original Estimate: 48h
> Remaining Estimate: 48h
>
> h3. *Q1.* What are you trying to do? Articulate your objectives using
> absolutely no jargon.
> Currently, in the join functions on
> [DataFrames|http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Dataset],
> the join types are defined via a string parameter called joinType. In order
> for a developer to know which joins are possible, they must look up the API
> call for join. While this works fine, it can cause the developer to make a
> typo resulting in improper joins and/or unexpected errors that aren't evident
> at compile time. The objective of this improvement would be to allow
> developers to use a common definition for join types (by enum or constants)
> called JoinTypes. This would contain the possible joins and remove the
> possibility of a typo. It would also allow Spark to alter the names of the
> joins in the future without impacting end-users.
> h3. *Q2.* What problem is this proposal NOT designed to solve?
> The problem this solves is extremely narrow, it would not solve anything
> other than providing a common definition for join types.
> h3. *Q3.* How is it done today, and what are the limits of current practice?
> Currently, developers must join two DataFrames like so:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"),
> "left_outer")
> {code}
> Where they manually type the join type. As stated above, this:
> * Requires developers to manually type in the join
> * Can cause possibility of typos
> * Restricts renaming of join types as its a literal string
> * Does not restrict and/or compile check the join type being used, leading
> to runtime errors
> h3. *Q4.* What is new in your approach and why do you think it will be
> successful?
> The new approach would use constants or *more preferably an enum*, something
> like this:
> {code:java}
> val resultDF = leftDF.join(rightDF, col("ID") === col("RightID"),
> JoinType.LEFT_OUTER)
> {code}
> This would provide:
> * In code reference/definitions of the possible join types
> ** This subsequently allows the addition of scaladoc of what each join type
> does and how it operates
> * Removes possibilities of a typo on the join type
> * Provides compile time checking of the join type (only if an enum is used)
> To clarify, if JoinType is a constant, it would just fill in the joinType
> string parameter for users. If an enum is used, it would restrict the domain
> of possible join types to whatever is defined in the future JoinType enum.
> The enum is preferred, however it would take longer to implement.
> h3. *Q5.* Who cares? If you are successful, what difference will it make?
> Developers using Apache Spark will care. This will make the join function
> easier to wield and lead to less runtime errors. It will save time by
> bringing join type validation at compile time. It will also provide in code
> reference to the join types, which saves the developer time of having to look
> up and navigate the multiple join functions to find the possible join types.
> In addition to that, the resulting constants/enum would have documentation on
> how that join type works.
> h3. *Q6.* What are the risks?
> Users of Apache Spark who currently use strings to define their join types
> could be impacted if an enum is chosen as the common definition. This risk
> can be mitigated by using string constants. The string constants would be the
> exact same string as the string literals used today. For example:
> {code:java}
> JoinType.INNER = "inner"
> {code}
> If an enum is still the preferred way of defining the join types, new join
> functions could be added that take in these enums and the join calls that
> contain string parameters for joinType could be deprecated. This would give
> developers a chance to change over to the new join types.
> h3. *Q7.* How long will it take?
> A few days for a seasoned Spark developer.
> h3. *Q8.* What are the mid-term and final "exams" to check for success?
> Mid-term exam would be the addition of a common definition of the join types
> and additional join functions that take in the join type enum/constant. The
> final exam would be working tests written to check the functionality of these
> new join functions and the join functions that take a string for joinType
> would be deprecated.
> h3. *Appendix A.* Proposed API Changes. Optional section defining APIs
> changes, if any. Backward and forward compatibility must be taken into
> account.
> {color:#FF0000}*It is heavily recommended that enums, and not string
> constants are used.*{color} String constants are presented as a possible
> solution but not the ideal solution.
> *If enums are used (preferred):*
> The following join function signatures would be added to the Dataset API:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: JoinType): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: JoinType):
> DataFrame
> {code}
> The following functions would be deprecated:
> {code:java}
> def join(right: Dataset[_], joinExprs: Column, joinType: String): DataFrame
> def join(right: Dataset[_], usingColumns: Seq[String], joinType: String):
> DataFrame
> {code}
> A new enum would be created called JoinType. Developers would be encouraged
> to adopt using JoinType instead of the literal strings.
> *If string constants are used:*
> No current API changes, however a new Scala object with string constants
> would be defined like so:
> {code:java}
> object JoinType {
> final val INNER: String = "inner"
> final val LEFT_OUTER: String = "left_outer"
> }
> {code}
> This approach would not allow for compile time checking of the join types.
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]