[ 
https://issues.apache.org/jira/browse/SPARK-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118771#comment-14118771
 ] 

Nicholas Chammas commented on SPARK-3354:
-----------------------------------------

[~marmbrus] - This one's for your radar.

> Add LENGTH and DATALENGTH functions to Spark SQL
> ------------------------------------------------
>
>                 Key: SPARK-3354
>                 URL: https://issues.apache.org/jira/browse/SPARK-3354
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Nicholas Chammas
>
> It's very common when working on data sets of strings to want to know the 
> lengths of the strings you are analyzing. Say I have some Tweets and want to 
> find the average length of a Tweet by language.
> {code}
> SELECT language, AVG(LEN(tweet)) AS avg_length
> FROM tweets
> GROUP BY language
> ORDER BY avg_length DESC;
> {code}
> This is currently not possible because Spark SQL doesn't have a {{LEN()}} 
> function.
> Another common function that would be useful is one that gives the size of 
> the data item in bytes. This can be handy when moving data from Spark SQL to 
> another system and you need to know how to size the receiving fields 
> appropriately.
> *Proposal*
> * Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This 
> function returns the number of characters in a string expression.
> * Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of 
> {{DATALENGTH}}. This function returns the number of bytes in any expression.
> *Special care* must be given to the following cases:
> * multi-byte characters
> * {{NULL}}
> * trailing spaces
> *Examples*
> These are suggestions for how these 2 functions should work.
> {code}
> LEN('Hello') -> 5
> LEN('안녕') -> 2
> LEN('Hello 안녕') -> 8
> LEN(NULL) -> NULL
> LEN('') -> 0
> LEN('Bob  ') -> 3
> {code}
> In this last example with {{'Bob  '}}, trailing spaces are ignored. This 
> matches the [behavior of SQL 
> Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could 
> opt to include the spaces.
> {code}
> DATALEN('Hello') -> 5
> DATALEN('안녕') -> 4
> DATALEN('Hello 안녕') -> 16
> DATALEN(NULL) -> NULL
> DATALEN('') -> 0
> DATALEN('Bob  ') -> 5
> {code}
> Note here how mixing English and Korean characters causes every character to 
> be interpreted as a 2 byte wide character. Dunno if this sane; this may 
> depend on Scala or JVM details that I wouldn't know about at the moment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to