Re: [julia-users] Re: split a utf-8 string

Dan Sun, 22 Nov 2015 03:58:18 -0800

`split` already has hooks for using other splitters. To achieve the UTF8 
space splitting functionality, one can leverage the `isspace` function and 
add some decorations. For example:


type FuncSplitter
       pred::Function
end
function Base.search(s::AbstractString, splt::FuncSplitter, start::Int = 1)
       i = start
       n = endof(s)
       while i<=endof(s) 
           if splt.pred(s[i])
               return i
           end
           i = nextind(s,i)
       end
       return 0
end
spacesplitter = FuncSplitter(x->isspace(x))

s = "Ｔｉｍｅ ｆｌｉｅｓ ｌｉｋｅ ａｎ ａｒｒｏｗ"
split(s,spacesplitter)

Gives:
5-element Array{SubString{UTF8String},1}:
 "Ｔｉｍｅ" 
 "ｆｌｉｅｓ"
 "ｌｉｋｅ" 
 "ａｎ"   
 "ａｒｒｏｗ"

This isn't fully optimized, but probably suffices for many uses.

On Sunday, November 22, 2015 at 1:29:46 PM UTC+2, Pontus Stenetorp wrote:
>
> On 22 November 2015 at 01:46,  <[email protected] <javascript:>> wrote: 
> > 
> > On Sunday, November 22, 2015 at 10:02:03 AM UTC+10, James Gilbert wrote: 
> >> 
> >> The spaces in your string are '\u3000' the ideographic space. 
> >> isspace('\u3000') returns true, and split(s) is supposed to split on 
> all 
> >> space characters, so I think this might be a julia bug. 
> > 
> > Or a documentation bug, the actual default is only the ASCII spaces 
> > https://github.com/JuliaLang/julia/blob/master/base/strings/util.jl#L62 
>
> It should probably be pointed out that at least Python3 (but not 
> Python2) gets it "right". 
>
>     > python3 
>     Python 3.4.3+ (default, Oct 14 2015, 16:03:50) 
>     [GCC 5.2.1 20151010] on linux 
>     Type "help", "copyright", "credits" or "license" for more information. 
>     >>> "Ｔｉｍｅ ｆｌｉｅｓ ｌｉｋｅ ａｎ ａｒｒｏｗ".split() 
>     ['Ｔｉｍｅ', 'ｆｌｉｅｓ', 'ｌｉｋｅ', 'ａｎ', 'ａｒｒｏｗ'] 
>
> I would argue that Unicode is a first class citizen and that Julia 
> should also get this "right".  This would require some fairly 
> straightforward, yet not trivial, tinkering and would be an excellent 
> first contribution if someone wants to take a stab at it. 
>
>     Pontus 
>

Re: [julia-users] Re: split a utf-8 string

Reply via email to