Nicholas Chammas created SPARK-3354:
---------------------------------------
Summary: Add LENGTH and DATALENGTH functions to Spark SQL
Key: SPARK-3354
URL: https://issues.apache.org/jira/browse/SPARK-3354
Project: Spark
Issue Type: Improvement
Components: SQL
Reporter: Nicholas Chammas
It's very common when working on data sets of strings to want to know the
lengths of the strings you are analyzing. Say I have some Tweets and want to
find the average length of a Tweet by language.
{code}
SELECT language, AVG(LEN(tweet)) AS avg_length
FROM tweets
GROUP BY language
ORDER BY avg_length DESC;
{code}
This is currently not possible because Spark SQL doesn't have a {{LEN()}}
function.
Another common function that would be useful is one that gives the size of the
data item in bytes. This can be handy when moving data from Spark SQL to
another system and you need to know how to size the receiving fields
appropriately.
*Proposal*
* Add a {{LENGTH}} function. Make {{LEN}} a synonym of {{LENGTH}}. This
function returns the number of characters in a string expression.
* Add a {{DATALENGTH}} function. Make {{DATALEN}} a synonym of {{DATALENGTH}}.
This function returns the number of bytes in any expression.
*Special care* must be given to the following cases:
* multi-byte characters
* {{NULL}}
* trailing spaces
*Examples*
These are suggestions for how these 2 functions should work.
{code}
LEN('Hello') -> 5
LEN('안녕') -> 2
LEN('Hello 안녕') -> 8
LEN(NULL) -> NULL
LEN('') -> 0
LEN('Bob ') -> 3
{code}
In this last example with {{'Bob '}}, trailing spaces are ignored. This
matches the [behavior of SQL
Server|http://msdn.microsoft.com/en-us/library/ms190329.aspx], but we could opt
to include the spaces.
{code}
DATALEN('Hello') -> 5
DATALEN('안녕') -> 4
DATALEN('Hello 안녕') -> 16
DATALEN(NULL) -> NULL
DATALEN('') -> 0
DATALEN('Bob ') -> 5
{code}
Note here how mixing English and Korean characters causes every character to be
interpreted as a 2 byte wide character. Dunno if this sane; this may depend on
Scala or JVM details that I wouldn't know about at the moment.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]