[
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737041#comment-17737041
]
Kamil Kliczbor edited comment on SPARK-43999 at 6/26/23 8:23 AM:
-----------------------------------------------------------------
[~fanjia] , thanks for the feedback. When do you plan to fix this behaviour?
was (Author: JIRAUSER300774):
[~fanjia] , thanks for the feedback. Where do you plan to fix this behaviour?
> Data is still fetched even though result was returned
> -----------------------------------------------------
>
> Key: SPARK-43999
> URL: https://issues.apache.org/jira/browse/SPARK-43999
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.4.0
> Environment: Production
> Reporter: Kamil Kliczbor
> Priority: Major
> Attachments: Profiler.PNG
>
>
> h2. Short problem description:
> I have two tables:
> * tab1 is empty
> * tab6 has milions of records
> * when Spark returns results due to empty database table tab1, it still asks
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned
> immediately, however under the hood the huge table is requested to return the
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a
> named instance SQL2012.
> 2. Download and install
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
> (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master],
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase
> the number of the rows, approximately 1 mln is enough - select should take
> several seconds
> {code}
> h3. Prepare Spark
> # Download mssql jdbc driver in version 12.2.0
> # Run spark-shell2.cmd with the settings -cp
> "<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
> url
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>
> dbtable 'dbo.Tab1',
> user 'spark',
> password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
> driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
> url
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>
> dbtable 'dbo.Tab6',
> user 'spark',
> password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
> # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler).
> Create new trace to the "QueueSlots" database. Use filtering options to see
> only queries issued for that database (Events Selection tab -> check Show all
> events and Show all columns, then click Column Filters -> DatabaseName like
> QueueSlots).
> # Run the trace
> h3. Run the query in Spark console
> # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
> FROM tab1 as t1
> LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+----+----+
> | Id|Name|Name|
> +---+----+----+
> +---+----+----+
> [Stage 63:> (0 + 1) /
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are
> not queried.
> h3. {color:#de350b}Given:{color}
> The table6 is requested to return the data even though it is not being used
> and it is CPU and IO consuming operation.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]