[
https://issues.apache.org/jira/browse/SPARK-3354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14118771#comment-14118771
]
Nicholas Chammas commented on SPARK-3354:
-----------------------------------------
[~marmbrus] - This one's for your radar.
> Add LENGTH and DATALENGTH functions to Spark SQL
> ------------------------------------------------
>
> Key: SPARK-3354
> URL: https://issues.apache.org/jira/browse/SPARK-3354
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Reporter: Nicholas Chammas
>
> It's very common when working on data sets of strings to want to know the
> lengths of the strings you are analyzing. Say I have some Tweets and want to
> find the average length of a Tweet by language.
> {code}
> SELECT language, AVG(LEN(tweet)) AS avg_length
> FROM tweets
> GROUP BY language
> ORDER BY avg_length DESC;
> {code}
> This is currently not possible because Spark SQL doesn't have a {{LEN()}}
> function.
> Another common function that would be useful is one that gives the size of
> the data item in bytes. This can be handy when moving data from Spark SQL to
> another system and you need to know how to size the receiving fields
> appropriately.
> *Proposal*
> * Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This
> function returns the number of characters in a string expression.
> * Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of
> {{DATALENGTH}}. This function returns the number of bytes in any expression.
> *Special care* must be given to the following cases:
> * multi-byte characters
> * {{NULL}}
> * trailing spaces
> *Examples*
> These are suggestions for how these 2 functions should work.
> {code}
> LEN('Hello') -> 5
> LEN('안녕') -> 2
> LEN('Hello 안녕') -> 8
> LEN(NULL) -> NULL
> LEN('') -> 0
> LEN('Bob ') -> 3
> {code}
> In this last example with {{'Bob '}}, trailing spaces are ignored. This
> matches the [behavior of SQL
> Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could
> opt to include the spaces.
> {code}
> DATALEN('Hello') -> 5
> DATALEN('안녕') -> 4
> DATALEN('Hello 안녕') -> 16
> DATALEN(NULL) -> NULL
> DATALEN('') -> 0
> DATALEN('Bob ') -> 5
> {code}
> Note here how mixing English and Korean characters causes every character to
> be interpreted as a 2 byte wide character. Dunno if this sane; this may
> depend on Scala or JVM details that I wouldn't know about at the moment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]