[ 
https://issues.apache.org/jira/browse/SPARK-26203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16717776#comment-16717776
 ] 

ASF GitHub Bot commented on SPARK-26203:
----------------------------------------

aokolnychyi opened a new pull request #23291: [SPARK-26203][SQL] Benchmark 
performance of In and InSet expressions
URL: https://github.com/apache/spark/pull/23291
 
 
   ## What changes were proposed in this pull request?
   
   This PR contains benchmarks for `In` and `InSet` expressions. They cover 
literals of different data types and will help us to decide where to integrate 
the switch-based logic for bytes/shorts/ints.
   
   As discussed in [PR-23171](https://github.com/apache/spark/pull/23171), one 
potential approach is to convert `In` to `InSet` if all elements are literals 
independently of data types and the number of elements. According to the 
results of this PR, we might want to keep the threshold for the number of 
elements. The if-else approach approach might be faster for some data types on 
a small number of elements (structs? arrays? small decimals?).
   
   The execution time for all benchmarks is around 4 minutes.
   
   ### byte / short / int / long
   
   Unless the number of items is really big, `InSet` is slower than `In` 
because of autoboxing . 
   
   Interestingly, `In` scales worse on bytes/shorts than on ints/longs. For 
example, `InSet` starts to match the performance on around 50 bytes/shorts 
while this does not happen on the same number of ints/longs. This is a bit 
strange as shorts/bytes (e.g., `(byte) 1`, `(short) 2`) are represented as ints 
in the bytecode.
   
   ### float / double
   
   Use cases on floats/doubles also suffer from autoboxing. Therefore, `In` 
outperforms `InSet` on 10 elements.
   
   Similarly to shorts/bytes, `In` scales worse on floats/doubles than on 
ints/longs because the equality condition is more complicated (e.g., 
`java.lang.Float.isNaN(filter_valueArg_0) && java.lang.Float.isNaN(9.0F)) || 
filter_valueArg_0 == 9.0F`). 
   
   ### decimal
   
   The reason why we have separate benchmarks for small and large decimals is 
that Spark might use longs to represent decimals in some cases.
   
   If this optimization happens, then `equals` will be nothing else as 
comparing longs. If this does not happen, Spark will create an instance of 
`scala.BigDecimal` and use it for comparisons. The latter is more expensive.
   
   `Decimal$hashCode` will always use `scala.BigDecimal$hashCode` even if the 
number is small enough to fit into a long variable. As a consequence, we see 
that use cases on small decimals are faster with `In` as they are using long 
comparisons under the hood. Large decimal values are always faster with `InSet`.
   
   ### string
   
   `UTF8String$equals` is not cheap. Therefore, `In` does not really outperform 
`InSet` as in previous use cases.
   
   ### timestamp / date
   
   Under the hood, timestamp/date values will be represented as long/int 
values. So, `In` allows us to avoid autoboxing.
   
   ### array
   
   Arrays are working as expected. `In` is faster on 5 elements while `InSet` 
is faster on 15 elements. The benchmarks are using `UnsafeArrayData`. 
   
   ### struct
   
   `InSet` is always faster than `In` for structs. These benchmarks use 
`GenericInternalRow`.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Benchmark performance of In and InSet expressions
> -------------------------------------------------
>
>                 Key: SPARK-26203
>                 URL: https://issues.apache.org/jira/browse/SPARK-26203
>             Project: Spark
>          Issue Type: Test
>          Components: SQL
>    Affects Versions: 3.0.0
>            Reporter: Anton Okolnychyi
>            Priority: Major
>
> {{OptimizeIn}} rule replaces {{In}} with {{InSet}} if the number of possible 
> values exceeds "spark.sql.optimizer.inSetConversionThreshold" and all values 
> are literals. This was done for performance reasons to avoid O\(n\) time 
> complexity for {{In}}.
> The original optimization was done in SPARK-3711. A lot has changed after 
> that (e.g., generation of Java code to evaluate expressions), so it is worth 
> to measure the performance of this optimization again.
> According to my local benchmarks, {{InSet}} can be up to 10x time slower than 
> {{In}} due to autoboxing and other issues.
> The scope of this JIRA is to benchmark every supported data type inside 
> {{In}} and {{InSet}} and outline existing bottlenecks. Once we have this 
> information, we can come up with solutions. 
> Based on my preliminary investigation, we can do quite some optimizations, 
> which quite frequently depend on a specific data type.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to