`split` already has hooks for using other splitters. To achieve the UTF8
space splitting functionality, one can leverage the `isspace` function and
add some decorations. For example:
type FuncSplitter
pred::Function
end
function Base.search(s::AbstractString, splt::FuncSplitter, start::Int = 1)
i = start
n = endof(s)
while i<=endof(s)
if splt.pred(s[i])
return i
end
i = nextind(s,i)
end
return 0
end
spacesplitter = FuncSplitter(x->isspace(x))
s = "Time flies like an arrow"
split(s,spacesplitter)
Gives:
5-element Array{SubString{UTF8String},1}:
"Time"
"flies"
"like"
"an"
"arrow"
This isn't fully optimized, but probably suffices for many uses.
On Sunday, November 22, 2015 at 1:29:46 PM UTC+2, Pontus Stenetorp wrote:
>
> On 22 November 2015 at 01:46, <[email protected] <javascript:>> wrote:
> >
> > On Sunday, November 22, 2015 at 10:02:03 AM UTC+10, James Gilbert wrote:
> >>
> >> The spaces in your string are '\u3000' the ideographic space.
> >> isspace('\u3000') returns true, and split(s) is supposed to split on
> all
> >> space characters, so I think this might be a julia bug.
> >
> > Or a documentation bug, the actual default is only the ASCII spaces
> > https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L62
>
> It should probably be pointed out that at least Python3 (but not
> Python2) gets it "right".
>
> > python3
> Python 3.4.3+ (default, Oct 14 2015, 16:03:50)
> [GCC 5.2.1 20151010] on linux
> Type "help", "copyright", "credits" or "license" for more information.
> >>> "Time flies like an arrow".split()
> ['Time', 'flies', 'like', 'an', 'arrow']
>
> I would argue that Unicode is a first class citizen and that Julia
> should also get this "right". This would require some fairly
> straightforward, yet not trivial, tinkering and would be an excellent
> first contribution if someone wants to take a stab at it.
>
> Pontus
>