[
https://issues.apache.org/jira/browse/ARROW-13615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Dragoș Moldovan-Grünfeld updated ARROW-13615:
---------------------------------------------
Description:
There is more to this issue than meets the eye. The
{{stringr::str_to_sentence()}} does 2 things:
* capitalise the first word
* if there are multiple sentences provided as a single string, attempts to
find sentence breaks and capitalise the first word of each sentence.
The {{stringr}} implementation wraps {{stringi::str_trans_totitle()}}, which in
turns uses ICU’s BreakIterator to locate specific text boundaries. As a
consequence {{stringr::str_to_title()}} is not able to identify a full stop /
period (".") as a sentence end and does not capitalise words following it.
Thus, there is a discrepancy between behaviour of the {{utf8_capitalize}}
kernel (which capitalises the first word of a string without making any attempt
to break into sentences) and the behaviour of {{stringr::str_to_sentence()}}.
For more extensive discussions around the {{stringi / stringr}} implementation
see {{stringr}} issues [202|https://github.com/tidyverse/stringr/issues/202]
and [231|https://github.com/tidyverse/stringr/issues/231].
Due to the complexity of this issue and the relatively niche use cases, the
recommendation is to postpone implementation.
was:
There is more to this issue than meets the eye. The
{{stringr::str_to_sentence()}} does 2 things:
* capitalise the first word
* if there are multiple sentences provided as a single string, attempts to
find sentence breaks and capitalise the first word of each sentence.
The {{stringr}} implementation wraps {{stringi::str_trans_totitle()}}, which in
turns uses ICU’s BreakIterator to locate specific text boundaries. As a
consequence {{stringr::str_to_title()}} is not able to identify a full stop /
period (".") as a sentence end and does not capitalise words following it.
Thus, there is a discrepancy between behaviour of the {{utf8_capitalize}}
kernel (which capitalises the first word of a string without making any attempt
to break into sentences) and the behaviour of {{stringr::str_to_sentence()}}.
For more extensive discussions around the {{stringi / stringr}} implementation
see {{stringr}} issues [202|https://github.com/tidyverse/stringr/issues/202]
and [231|https://github.com/tidyverse/stringr/issues/231].
Due to the complexity of this issue and the relatively niche use cases, the
recommendation is to postpone implementation until someone requests it.
> [R] Bindings for stringr::str_to_sentence
> -----------------------------------------
>
> Key: ARROW-13615
> URL: https://issues.apache.org/jira/browse/ARROW-13615
> Project: Apache Arrow
> Issue Type: Improvement
> Components: R
> Reporter: Nicola Crane
> Assignee: Dragoș Moldovan-Grünfeld
> Priority: Major
> Labels: good-first-issue, kernel, pull-request-available
> Fix For: 7.0.0
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> There is more to this issue than meets the eye. The
> {{stringr::str_to_sentence()}} does 2 things:
> * capitalise the first word
> * if there are multiple sentences provided as a single string, attempts to
> find sentence breaks and capitalise the first word of each sentence.
> The {{stringr}} implementation wraps {{stringi::str_trans_totitle()}}, which
> in turns uses ICU’s BreakIterator to locate specific text boundaries. As a
> consequence {{stringr::str_to_title()}} is not able to identify a full stop /
> period (".") as a sentence end and does not capitalise words following it.
> Thus, there is a discrepancy between behaviour of the {{utf8_capitalize}}
> kernel (which capitalises the first word of a string without making any
> attempt to break into sentences) and the behaviour of
> {{stringr::str_to_sentence()}}.
> For more extensive discussions around the {{stringi / stringr}}
> implementation see {{stringr}} issues
> [202|https://github.com/tidyverse/stringr/issues/202] and
> [231|https://github.com/tidyverse/stringr/issues/231].
> Due to the complexity of this issue and the relatively niche use cases, the
> recommendation is to postpone implementation.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)