On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke <and...@acooke.org> wrote: > > ok, so match(regex, string, index) solves the problem. presumably it exists > exactly for this reason....?
At least I think this is a valid usecase. > > andrew > > > On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote: >> >> >> hmm. ignore that last statement (same problem). still checking / >> confused. sorry. >> >> On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote: >>> >>> >>> i think that returns a substring (ir a view onto the backing string). ``` julia> typeof("aaa"[2:end]) ASCIIString julia> SubString("aaa", 2, 3) "aa" julia> typeof(SubString("aaa", 2, 3)) SubString{ASCIIString} ``` >>> but i am not sure. i did read a discussion somewhere saying that because of >>> this you should use bytestring(...) before passing a string to c. which is >>> all the evidence i have for my guess. >>> >>> incidentally, match(...) has a method that takes the offset to start at >>> as an argument. so i can avoid s[i:end] and just pass i into match (i just >>> found this). >>> >>> however, somewhat surprisingly, it also has the same problem. >>> >>> andrew >>> >>> >>> On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote: >>>> >>>> On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash <vtj...@gmail.com> wrote: >>>> > does `copy` work? although `bytestring` also seems like a good method >>>> > for >>>> > this also. it seems wrong to me also that `match` is making a copy of >>>> > the >>>> > original string (if that is indeed what it is doing) >>>> >>>> Isn't it `s[i:end]` that is doing the copy? >>>> >>>> > >>>> > On Tue, Jul 21, 2015 at 6:57 PM andrew cooke <and...@acooke.org> >>>> > wrote: >>>> >> >>>> >> >>>> >> string(bytestring(...)) seems to do it. would appreciate any more >>>> >> efficient solutions (and confirmation the analysis is correct - is >>>> >> this >>>> >> worth filing as an issue?) >>>> >> >>>> >> >>>> >> On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote: >>>> >>> >>>> >>> >>>> >>> well, this was fun... the following code rapidly triggers the OOM >>>> >>> killer >>>> >>> on my machine (julia 0.4 trunk): >>>> >>> >>>> >>> s = repeat("a", 1000000) >>>> >>> l = Any[] >>>> >>> r = r"^\w" >>>> >>> >>>> >>> for i in 1:length(s) >>>> >>> m = match(r, s[i:end]) >>>> >>> push!(l, m.match) >>>> >>> end >>>> >>> >>>> >>> note that: (1) the regexp is only matching one character, so the >>>> >>> array l >>>> >>> is at most a million characters long. >>>> >>> >>>> >>> what i think is happening (but this is only a guess) is that >>>> >>> s[i:end] is >>>> >>> being passed though to the c level regexp library as a new string. >>>> >>> the >>>> >>> result (m.match) is then a substring into that. because the >>>> >>> substring is >>>> >>> kept around, the backing string cannot be collected. and so there's >>>> >>> an n^2 >>>> >>> memory use. >>>> >>> >>>> >>> ideally, i don't think a new copy of the string should be passed to >>>> >>> the >>>> >>> regexp engine. maybe i am wrong? >>>> >>> >>>> >>> anyway, for now, if the above is right, i need some way to copy >>>> >>> m.match. >>>> >>> as far as i can tell string() doesn't help. so what works? or am i >>>> >>> wrong? >>>> >>> >>>> >>> thanks, >>>> >>> andrew