[ 
https://issues.apache.org/jira/browse/ARROW-13615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dragoș Moldovan-Grünfeld updated ARROW-13615:
---------------------------------------------
    Description: 
There is more to this issue than meets the eye. The 
{{stringr::str_to_sentence()}} does 2 things:
 * capitalise the first word 
 * if there are multiple sentences provided as a single string, attempts to 
find sentence breaks and capitalise the first word of each sentence.

The {{stringr}} implementation wraps {{stringi::str_trans_totitle()}}, which in 
turns uses ICU’s BreakIterator to locate specific text boundaries. As a 
consequence {{stringr::str_to_title()}} is not able to identify a full stop / 
period (".") as a sentence end and does not capitalise words following it. 
Thus, there is a discrepancy between behaviour of the {{utf8_capitalize}} 
kernel (which capitalises the first word of a string without making any attempt 
to break into sentences) and the behaviour of {{stringr::str_to_sentence()}}.

For more extensive discussions around the {{stringi / stringr}} implementation 
see {{stringr}} issues [202|https://github.com/tidyverse/stringr/issues/202] 
and [231|https://github.com/tidyverse/stringr/issues/231].

Due to the complexity of this issue and the relatively niche use cases, the 
recommendation is to postpone implementation.

  was:
There is more to this issue than meets the eye. The 
{{stringr::str_to_sentence()}} does 2 things:
 * capitalise the first word 
 * if there are multiple sentences provided as a single string, attempts to 
find sentence breaks and capitalise the first word of each sentence.

The {{stringr}} implementation wraps {{stringi::str_trans_totitle()}}, which in 
turns uses ICU’s BreakIterator to locate specific text boundaries. As a 
consequence {{stringr::str_to_title()}} is not able to identify a full stop / 
period (".") as a sentence end and does not capitalise words following it. 
Thus, there is a discrepancy between behaviour of the {{utf8_capitalize}} 
kernel (which capitalises the first word of a string without making any attempt 
to break into sentences) and the behaviour of {{stringr::str_to_sentence()}}.

For more extensive discussions around the {{stringi / stringr}} implementation 
see {{stringr}} issues [202|https://github.com/tidyverse/stringr/issues/202] 
and [231|https://github.com/tidyverse/stringr/issues/231].

Due to the complexity of this issue and the relatively niche use cases, the 
recommendation is to postpone implementation until someone requests it.


> [R] Bindings for stringr::str_to_sentence
> -----------------------------------------
>
>                 Key: ARROW-13615
>                 URL: https://issues.apache.org/jira/browse/ARROW-13615
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: R
>            Reporter: Nicola Crane
>            Assignee: Dragoș Moldovan-Grünfeld
>            Priority: Major
>              Labels: good-first-issue, kernel, pull-request-available
>             Fix For: 7.0.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> There is more to this issue than meets the eye. The 
> {{stringr::str_to_sentence()}} does 2 things:
>  * capitalise the first word 
>  * if there are multiple sentences provided as a single string, attempts to 
> find sentence breaks and capitalise the first word of each sentence.
> The {{stringr}} implementation wraps {{stringi::str_trans_totitle()}}, which 
> in turns uses ICU’s BreakIterator to locate specific text boundaries. As a 
> consequence {{stringr::str_to_title()}} is not able to identify a full stop / 
> period (".") as a sentence end and does not capitalise words following it. 
> Thus, there is a discrepancy between behaviour of the {{utf8_capitalize}} 
> kernel (which capitalises the first word of a string without making any 
> attempt to break into sentences) and the behaviour of 
> {{stringr::str_to_sentence()}}.
> For more extensive discussions around the {{stringi / stringr}} 
> implementation see {{stringr}} issues 
> [202|https://github.com/tidyverse/stringr/issues/202] and 
> [231|https://github.com/tidyverse/stringr/issues/231].
> Due to the complexity of this issue and the relatively niche use cases, the 
> recommendation is to postpone implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to