[ 
https://issues.apache.org/jira/browse/SPARK-43999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17737045#comment-17737045
 ] 

Jia Fan commented on SPARK-43999:
---------------------------------

> When do you plan to fix this behaviour?

 

I'm doing this.

> Data is still fetched even though result was returned
> -----------------------------------------------------
>
>                 Key: SPARK-43999
>                 URL: https://issues.apache.org/jira/browse/SPARK-43999
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.4.0
>         Environment: Production
>            Reporter: Kamil Kliczbor
>            Priority: Major
>         Attachments: Profiler.PNG
>
>
> h2. Short problem description:
> I have two tables:
>  * tab1 is empty
>  * tab6 has milions of records
>  * when Spark returns results due to empty database table tab1, it still asks 
> for the tab6 data
> When I create the query that uses LEFT JOIN, the results are returned 
> immediately, however under the hood the huge table is requested to return the 
> results anyway.
> h2. Repro:
> h3. Prepare the MSSQL server database 
> 1. Install SQLExpress (in my case MSSQL2012, but can be any version) as a 
> named instance SQL2012.
> 2. Download and install 
> [SSMS|https://learn.microsoft.com/en-us/sql/ssms/download-sql-server-management-studio-ssms?view=sql-server-ver16]
>  (or any other tool) and run the following Query
> {code:sql}
> USE [master]
> GO
> CREATE DATABASE QueueSlots
> GO
> CREATE LOGIN [spark] WITH PASSWORD=N'spark', DEFAULT_DATABASE=[master], 
> CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF
> GO
> USE [QueueSlots]
> GO
> CREATE USER [spark] FOR LOGIN [spark] WITH DEFAULT_SCHEMA=[dbo]
> GO
> {code}
> 3. Then create the tables and fill the tab6 with the data:
> {code:sql}
> CREATE TABLE tab1 (Id INT, Name NVARCHAR(50))
> CREATE TABLE tab6 (Id INT, Name NVARCHAR(50))
> insert into tab6
> select o1.object_id as Id , o1.name as Name
> from sys.objects as o1 
> cross join sys.objects as o2
> cross join sys.objects as o3
> cross join sys.objects as o4
> -- it might be required to increase the numer of the cross joins to increase 
> the number of the rows, approximately 1 mln is enough - select should take 
> several seconds
> {code}
> h3. Prepare Spark
>  # Download mssql jdbc driver in version 12.2.0
>  # Run spark-shell2.cmd with the settings -cp 
> "<your_path>/lib/sqljdbc/12.2/mssql-jdbc-12.2.0.jre8.jar"
> h3. Create temporary views on Spark
> {code:java}
> val sqlContext = new org.apache.spark.sql.SQLContext(sc)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab1
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab1',
>   user 'spark',
>   password 'spark'
> )
> """)
> sqlContext.sql("""
> CREATE TEMPORARY VIEW tab6
> USING org.apache.spark.sql.jdbc
> OPTIONS (
>   driver 'com.microsoft.sqlserver.jdbc.SQLServerDriver',
>   url 
> 'jdbc:sqlserver://;serverName=localhost;instanceName=sql2012;databaseName=QueueSlots;encrypt=true;trustServerCertificate=true;',
>   
>   dbtable 'dbo.Tab6',
>   user 'spark',
>   password 'spark'
> )
> """)
> {code}
> h3. Enable SQL Server Profiler tracing
>  # Go to SSMS and open Sql Server Profiler (Tools -> Sql Server Profiler). 
> Create new trace to the "QueueSlots" database. Use filtering options to see 
> only queries issued for that database (Events Selection tab -> check Show all 
> events and Show all columns, then click Column Filters -> DatabaseName like 
> QueueSlots).
>  # Run the trace
> h3. Run the query in Spark console
>  # Run the following query
> {code:java}
> sqlContext.sql("""
> SELECT t1.Id, t1.Name, t6.Name
>   FROM tab1 as t1
>   LEFT OUTER JOIN tab6 AS t6 ON t6.Id = t1.Id
> """).show
> {code}
> The results are returned immediately as:
> {code:java}
> +---+----+----+
> | Id|Name|Name|
> +---+----+----+
> +---+----+----+
> [Stage 63:>                                                         (0 + 1) / 
> 1]
> {code}
> h3. {color:#00875a}Expected{color}
> As the results are returned immediately for empty table, another sources are 
> not queried.
> h3. {color:#de350b}Given:{color}
> The table6 is requested to return the data even though it is not being used 
> and it is CPU and IO consuming operation.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to