On Tue, Jul 21, 2015 at 8:19 PM, andrew cooke <and...@acooke.org> wrote:
>
> maybe it should check to see if the end matches?

My guess, is that this is an optimization that won't have too much
benefit because in most cases you can just call `match` with a start
idx.

>
> @glen - i think that will construct an ascii string, which isn't what you
> want if the underlying data are unicode.  i've always assumed string() does
> something smart and returns the "right thing", but haven't checked...

And from the code I pasted, `utf` should probably do the copy you want.

>
> andrew
>
>
> On Tuesday, 21 July 2015 20:49:12 UTC-3, Yichao Yu wrote:
>>
>> On Tue, Jul 21, 2015 at 7:38 PM, andrew cooke <and...@acooke.org> wrote:
>> >
>> > ah.  for some reason i was thinking they were invisible (somewhere below
>> > julia).
>> >
>> > ok, thanks.  so that explains things more clearly....
>> >
>> > ...except that(!) using SubString(s, i, endof(s)) and passing *that* to
>> > match still gives the memory issue.
>>
>> Hmmm,
>>
>> ```
>> match(re::Regex, str::Union{ByteString,SubString}, idx::Integer,
>> add_opts::UInt32=UInt32(0)) =
>>     match(re, utf8(str), idx, add_opts)
>> ```
>>
>> So match on a substring does a copy. I'm guessing this is because pcre
>> expect a c string (i.e. NULL terminated?)
>>
>> >
>> > so there's still something odd that i don't understand.  maybe it's just
>> > that the regexp lib doesn't know about SubString.
>> >
>> > andrew
>> >
>> >
>> >
>> > On Tuesday, 21 July 2015 20:32:53 UTC-3, Yichao Yu wrote:
>> >>
>> >> On Tue, Jul 21, 2015 at 7:26 PM, andrew cooke <and...@acooke.org>
>> >> wrote:
>> >> >
>> >> > ok, so match(regex, string, index) solves the problem.  presumably it
>> >> > exists
>> >> > exactly for this reason....?
>> >>
>> >> At least I think this is a valid usecase.
>> >>
>> >> >
>> >> > andrew
>> >> >
>> >> >
>> >> > On Tuesday, 21 July 2015 20:23:57 UTC-3, andrew cooke wrote:
>> >> >>
>> >> >>
>> >> >> hmm.  ignore that last statement (same problem).  still checking /
>> >> >> confused.  sorry.
>> >> >>
>> >> >> On Tuesday, 21 July 2015 20:20:46 UTC-3, andrew cooke wrote:
>> >> >>>
>> >> >>>
>> >> >>> i think that returns a substring (ir a view onto the backing
>> >> >>> string).
>> >>
>> >> ```
>> >> julia> typeof("aaa"[2:end])
>> >> ASCIIString
>> >>
>> >> julia> SubString("aaa", 2, 3)
>> >> "aa"
>> >>
>> >> julia> typeof(SubString("aaa", 2, 3))
>> >> SubString{ASCIIString}
>> >> ```
>> >>
>> >> >>> but i am not sure.  i did read a discussion somewhere saying that
>> >> >>> because of
>> >> >>> this you should use bytestring(...) before passing a string to c.
>> >> >>> which is
>> >> >>> all the evidence i have for my guess.
>> >> >>>
>> >> >>> incidentally, match(...) has a method that takes the offset to
>> >> >>> start
>> >> >>> at
>> >> >>> as an argument.  so i can avoid s[i:end] and just pass i into match
>> >> >>> (i
>> >> >>> just
>> >> >>> found this).
>> >> >>>
>> >> >>> however, somewhat surprisingly, it also has the same problem.
>> >> >>>
>> >> >>> andrew
>> >> >>>
>> >> >>>
>> >> >>> On Tuesday, 21 July 2015 20:15:58 UTC-3, Yichao Yu wrote:
>> >> >>>>
>> >> >>>> On Tue, Jul 21, 2015 at 7:08 PM, Jameson Nash <vtj...@gmail.com>
>> >> >>>> wrote:
>> >> >>>> > does `copy` work? although `bytestring` also seems like a good
>> >> >>>> > method
>> >> >>>> > for
>> >> >>>> > this also. it seems wrong to me also that `match` is making a
>> >> >>>> > copy
>> >> >>>> > of
>> >> >>>> > the
>> >> >>>> > original string (if that is indeed what it is doing)
>> >> >>>>
>> >> >>>> Isn't it `s[i:end]` that is doing the copy?
>> >> >>>>
>> >> >>>> >
>> >> >>>> > On Tue, Jul 21, 2015 at 6:57 PM andrew cooke <and...@acooke.org>
>> >> >>>> > wrote:
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> string(bytestring(...)) seems to do it.  would appreciate any
>> >> >>>> >> more
>> >> >>>> >> efficient solutions (and confirmation the analysis is correct -
>> >> >>>> >> is
>> >> >>>> >> this
>> >> >>>> >> worth filing as an issue?)
>> >> >>>> >>
>> >> >>>> >>
>> >> >>>> >> On Tuesday, 21 July 2015 19:33:05 UTC-3, andrew cooke wrote:
>> >> >>>> >>>
>> >> >>>> >>>
>> >> >>>> >>> well, this was fun...  the following code rapidly triggers the
>> >> >>>> >>> OOM
>> >> >>>> >>> killer
>> >> >>>> >>> on my machine (julia 0.4 trunk):
>> >> >>>> >>>
>> >> >>>> >>> s = repeat("a", 1000000)
>> >> >>>> >>> l = Any[]
>> >> >>>> >>> r = r"^\w"
>> >> >>>> >>>
>> >> >>>> >>> for i in 1:length(s)
>> >> >>>> >>>     m = match(r, s[i:end])
>> >> >>>> >>>     push!(l, m.match)
>> >> >>>> >>> end
>> >> >>>> >>>
>> >> >>>> >>> note that: (1) the regexp is only matching one character, so
>> >> >>>> >>> the
>> >> >>>> >>> array l
>> >> >>>> >>> is at most a million characters long.
>> >> >>>> >>>
>> >> >>>> >>> what i think is happening (but this is only a guess) is that
>> >> >>>> >>> s[i:end] is
>> >> >>>> >>> being passed though to the c level regexp library as a new
>> >> >>>> >>> string.
>> >> >>>> >>> the
>> >> >>>> >>> result (m.match) is then a substring into that.  because the
>> >> >>>> >>> substring is
>> >> >>>> >>> kept around, the backing string cannot be collected.  and so
>> >> >>>> >>> there's
>> >> >>>> >>> an n^2
>> >> >>>> >>> memory use.
>> >> >>>> >>>
>> >> >>>> >>> ideally, i don't think a new copy of the string should be
>> >> >>>> >>> passed
>> >> >>>> >>> to
>> >> >>>> >>> the
>> >> >>>> >>> regexp engine.  maybe i am wrong?
>> >> >>>> >>>
>> >> >>>> >>> anyway, for now, if the above is right, i need some way to
>> >> >>>> >>> copy
>> >> >>>> >>> m.match.
>> >> >>>> >>> as far as i can tell string() doesn't help.  so what works?
>> >> >>>> >>> or
>> >> >>>> >>> am i
>> >> >>>> >>> wrong?
>> >> >>>> >>>
>> >> >>>> >>> thanks,
>> >> >>>> >>> andrew

Reply via email to