[ 
https://issues.apache.org/jira/browse/SPARK-19843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900773#comment-15900773
 ] 

Apache Spark commented on SPARK-19843:
--------------------------------------

User 'tejasapatil' has created a pull request for this issue:
https://github.com/apache/spark/pull/17205

> UTF8String => (int / long) conversion expensive for invalid inputs
> ------------------------------------------------------------------
>
>                 Key: SPARK-19843
>                 URL: https://issues.apache.org/jira/browse/SPARK-19843
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.1.0
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>             Fix For: 2.2.0
>
>
> In case of invalid inputs, converting a UTF8String to int or long returns 
> null. This comes at a cost wherein the method for conversion (e.g [0]) would 
> throw an exception. Exception handling is expensive as it will convert the 
> UTF8String into a java string, populate the stack trace (which is a native 
> call). While migrating workloads from Hive -> Spark, I see that this at an 
> aggregate level affects the performance of queries in comparison with hive.
> The exception is just indicating that the conversion failed.. its not 
> propagated to users so it would be good to avoid.
> Couple of options:
> - Return Integer / Long (instead of primitive types) which can be set to NULL 
> if the conversion fails. This is boxing and super bad for perf so a big no.
> - Hive has a pre-check [1] for this which is not a perfect safety net but 
> helpful to capture typical bad inputs eg. empty string, "null".
> [0] : 
> https://github.com/apache/spark/blob/4ba9c6c453606f5e5a1e324d5f933d2c9307a604/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java#L950
> [1] : 
> https://github.com/apache/hive/blob/ff67cdda1c538dc65087878eeba3e165cf3230f4/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java#L90



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to