What about this | line matcher exps ranges negateRanges result | line := 'A política no Brasil está complicada #FAIL porque a corrupção impera #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país ao #CAOS'. matcher := '(#\w+)' asRegex. exps := matcher matchesIn: line. "i believe you can extend the matcher to give the subexpressions and ranges at the same time" ranges := matcher matchingRangesIn: line. negateRanges := OrderedCollection new. ranges inject: 1 into: [ :start :interval | negateRanges add: (Interval from: start to: (interval first - 1)). interval last + 1 ]. result := negateRanges inject: String new into: [ :s :interval | s, (line copyFrom: interval first to: interval last). ].
Array with: exps with: ranges with: negateRanges with: result On Tue, Aug 9, 2016 at 12:53 PM, Casimiro - GMAIL < casimiro.barr...@gmail.com> wrote: > Em 08-08-2016 19:25, Bernardo Ezequiel Contreras escreveu: > > Hi, > > have you try with (World>>Help>>Help Browser>>Regular Expressions > Framework>>Usage) > > SUBEXPRESSION MATCHES > > After a successful match attempt, you can query the specifics of which > part of the original string has matched which part of the whole > expression. > > (...) > > Thanks, but thing is: my need is little more complex than finding > sequences. I'm looking for expressions in natural language text. The > expressions must be extracted without ambiguities so I have cases for > occurrences in the beginning of line (aka '^(#\w+)([\s.,;\:!?]*)') in the > middle of the line (aka '([\s.,;\:!?]+)(#\w+)([\s.,;\:!?]+)') or at the > end (which may be simplified to the second case...). So, if I find several > hashtags in a text like: > > 'A política no Brasil está complicada #FAIL porque a corrupção impera > #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país > ao #CAOS' > > I want two things: > > 1st and obvious: #( '#FAIL' '#CRIME' '#PETRALHAS' '#CAOS') > 2nd: the line minus hashtags: 'A política no Brasil está complicada porque > a corrupção impera. De qualquer forma os, que tudo justificam, levam o país > ao' > > When I use regexps to process the line, for instance: > > bfr := line copyWithRegex: '#\w+' matchesReplacedUsing [ :e | '' ]. > > I can have trouble because it will extract things like #ANOTAÇÃO# which is > not a hashtag but will match. > > And I'm trying to avoid doing the Lex/Yacc thing here :D > > Best regards, > > CdAB > > -- > The information contained in this message is confidential and intended to > the recipients specified in the headers. If you received this message by > error, notify the sender immediately. The unauthorized use, disclosure, > copy or alteration of this message are strictly forbidden and subjected to > civil and criminal sanctions. > > == > > This email may be signed using PGP key *ID: 0x4134A417* > -- Bernardo E.C. Sent from a cheap desktop computer in South America.