[ 
https://issues.apache.org/jira/browse/ARROW-12712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406451#comment-17406451
 ] 

Eduardo Ponce commented on ARROW-12712:
---------------------------------------

There are several variants of string repeats across languages:
 1. Python, SQL - support a single integer value for *number of repeats*. All 
strings are replicated the same number of times.
 2. Support a sequence of integers where each value is the *number of repeats* 
for the string corresponding to that same index. This may have its complexities 
in Arrow if operating on unordered batches of data.
 3. Pandas, R - support (1) and (2).
 4. R - allow different number of input strings and repeat values.

Based on this, I consider having the string repeat in Arrow C++ support case 
(3).

Python:
{code:python}
>>> 'a' * 2  # 'aa'
>>> 'b' * 3  # 'bbb'
{code}
[Pandas has str.repeat 
function|https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.repeat.html]:
{code:python}
>>> s =pd.Series(['a', 'b'])
>>> s.str.repeat(2)          # ['aa', 'bb']
>>> s.str.repeat([2, 3])     # ['aa', 'bbb']
>>> s.str.repeat([2, 3, 4])  # Error: different length arrays
>>> s.str.repeat([2])        # Error: different length arrays
{code}
[SQL has replicate 
function|https://www.w3schools.com/sqL/func_sqlserver_replicate.asp]:
{code:sql}
SELECT REPLICATE('a', 2);  -- 'aa'
SELECT REPLICATE('b', 3);  -- 'bbb'
{code}
[R has strrep 
function|https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/strrep]:
{code:java}
> strrep(c('a', 'b'), 2)             # 'aa', 'bb'
> strrep(c('a', 'b'), c(2, 3))       # 'aa', 'bbb'
# R cycles strings/repeats if length of sequences differ
> strrep(c('a', 'b'), c(2, 3, 4))    # 'aa', 'bbb', 'aaaa'
> strrep(c('a', 'b', 'c'), c(2, 3))  # 'aa', 'bbb', 'cc'
{code}

> [C++] String repeat kernel
> --------------------------
>
>                 Key: ARROW-12712
>                 URL: https://issues.apache.org/jira/browse/ARROW-12712
>             Project: Apache Arrow
>          Issue Type: New Feature
>          Components: C++
>            Reporter: Ian Cook
>            Assignee: Eduardo Ponce
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 6.0.0
>
>          Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Like SQL {{replicate}} or Python {{'string' * n}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to