[
https://issues.apache.org/jira/browse/ARROW-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406451#comment-17406451
]
Eduardo Ponce edited comment on ARROW-12712 at 8/29/21, 6:17 PM:
-----------------------------------------------------------------
There are several variants of string repeats across languages:
1. Python, SQL - support a single integer value for *number of repeats*. All
strings are replicated the same number of times.
2. Support a sequence of integers where each value is the *number of repeats*
for the string corresponding to that same index. This may have its complexities
in Arrow if operating on unordered batches of data.
3. Pandas, R - support (1) and (2).
4. R - allow different number of input strings and repeat values.
Below are examples of each languages API and supported variants for string
repeat.
Based on this, I consider having the string repeat in Arrow C++ support case
(3) and for the case (2) consider invalid if number of strings and number of
repeats differ in length.
Python:
{code:python}
>>> 'a' * 2 # 'aa'
>>> 'b' * 3 # 'bbb'
{code}
[Pandas has str.repeat
function|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.repeat.html]:
{code:python}
>>> s =pd.Series(['a', 'b'])
>>> s.str.repeat(2) # ['aa', 'bb']
>>> s.str.repeat([2, 3]) # ['aa', 'bbb']
>>> s.str.repeat([2, 3, 4]) # Error: different length arrays
>>> s.str.repeat([2]) # Error: different length arrays
{code}
[SQL has replicate
function|https://www.w3schools.com/sqL/func_sqlserver_replicate.asp]:
{code:sql}
SELECT REPLICATE('a', 2); -- 'aa'
SELECT REPLICATE('b', 3); -- 'bbb'
{code}
[R has strrep
function|https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strrep]:
{code:java}
> strrep(c('a', 'b'), 2) # 'aa', 'bb'
> strrep(c('a', 'b'), c(2, 3)) # 'aa', 'bbb'
# R cycles strings/repeats if length of sequences differ
> strrep(c('a', 'b'), c(2, 3, 4)) # 'aa', 'bbb', 'aaaa'
> strrep(c('a', 'b', 'c'), c(2, 3)) # 'aa', 'bbb', 'cc'
{code}
was (Author: edponce):
There are several variants of string repeats across languages:
1. Python, SQL - support a single integer value for *number of repeats*. All
strings are replicated the same number of times.
2. Support a sequence of integers where each value is the *number of repeats*
for the string corresponding to that same index. This may have its complexities
in Arrow if operating on unordered batches of data.
3. Pandas, R - support (1) and (2).
4. R - allow different number of input strings and repeat values.
Based on this, I consider having the string repeat in Arrow C++ support case
(3).
Python:
{code:python}
>>> 'a' * 2 # 'aa'
>>> 'b' * 3 # 'bbb'
{code}
[Pandas has str.repeat
function|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.repeat.html]:
{code:python}
>>> s =pd.Series(['a', 'b'])
>>> s.str.repeat(2) # ['aa', 'bb']
>>> s.str.repeat([2, 3]) # ['aa', 'bbb']
>>> s.str.repeat([2, 3, 4]) # Error: different length arrays
>>> s.str.repeat([2]) # Error: different length arrays
{code}
[SQL has replicate
function|https://www.w3schools.com/sqL/func_sqlserver_replicate.asp]:
{code:sql}
SELECT REPLICATE('a', 2); -- 'aa'
SELECT REPLICATE('b', 3); -- 'bbb'
{code}
[R has strrep
function|https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strrep]:
{code:java}
> strrep(c('a', 'b'), 2) # 'aa', 'bb'
> strrep(c('a', 'b'), c(2, 3)) # 'aa', 'bbb'
# R cycles strings/repeats if length of sequences differ
> strrep(c('a', 'b'), c(2, 3, 4)) # 'aa', 'bbb', 'aaaa'
> strrep(c('a', 'b', 'c'), c(2, 3)) # 'aa', 'bbb', 'cc'
{code}
> [C++] String repeat kernel
> --------------------------
>
> Key: ARROW-12712
> URL: https://issues.apache.org/jira/browse/ARROW-12712
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Ian Cook
> Assignee: Eduardo Ponce
> Priority: Major
> Labels: pull-request-available
> Fix For: 6.0.0
>
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> Like SQL {{replicate}} or Python {{'string' * n}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)