What about this

| line matcher exps ranges negateRanges result |
line :=
'A política no Brasil está complicada #FAIL porque a corrupção impera
#CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país
ao #CAOS'.
matcher := '(#\w+)' asRegex.
exps := matcher matchesIn: line.
"i believe you can extend the matcher to give the subexpressions
and ranges at the same time"
ranges := matcher matchingRangesIn: line.
negateRanges := OrderedCollection new.
ranges inject: 1 into: [ :start :interval |
negateRanges add: (Interval from: start to: (interval first - 1)).
interval last + 1  ].
result :=
negateRanges inject: String new into: [ :s :interval |
s, (line copyFrom: interval first to: interval last).
].

Array with: exps
with: ranges
with: negateRanges
with: result


On Tue, Aug 9, 2016 at 12:53 PM, Casimiro - GMAIL <
casimiro.barr...@gmail.com> wrote:

> Em 08-08-2016 19:25, Bernardo Ezequiel Contreras escreveu:
>
> Hi,
>
>   have you try with  (World>>Help>>Help Browser>>Regular Expressions
> Framework>>Usage)
>
> SUBEXPRESSION MATCHES
>
> After a successful match attempt, you can query the specifics of which
> part of the original string has matched which part of the whole
> expression.
>
> (...)
>
> Thanks, but thing is: my need is little more complex than finding
> sequences. I'm looking for expressions in natural language text. The
> expressions must be extracted without ambiguities so I have cases for
> occurrences in the beginning of line (aka '^(#\w+)([\s.,;\:!?]*)') in the
> middle of the line (aka '([\s.,;\:!?]+)(#\w+)([\s.,;\:!?]+)') or at the
> end (which may be simplified to the second case...). So, if I find several
> hashtags in a text like:
>
> 'A política no Brasil está complicada #FAIL porque a corrupção impera
> #CRIME. De qualquer forma os #PETRALHAS, que tudo justificam, levam o país
> ao #CAOS'
>
> I want two things:
>
> 1st and obvious: #( '#FAIL' '#CRIME' '#PETRALHAS' '#CAOS')
> 2nd: the line minus hashtags: 'A política no Brasil está complicada porque
> a corrupção impera. De qualquer forma os, que tudo justificam, levam o país
> ao'
>
> When I use regexps to process the line, for instance:
>
> bfr := line copyWithRegex: '#\w+' matchesReplacedUsing [ :e | '' ].
>
> I can have trouble because it will extract things like #ANOTAÇÃO# which is
> not a hashtag but will match.
>
> And I'm trying to avoid doing the Lex/Yacc thing here :D
>
> Best regards,
>
> CdAB
>
> --
> The information contained in this message is confidential and intended to
> the recipients specified in the headers. If you received this message by
> error, notify the sender immediately. The unauthorized use, disclosure,
> copy or alteration of this message are strictly forbidden and subjected to
> civil and criminal sanctions.
>
> ==
>
> This email may be signed using PGP key *ID: 0x4134A417*
>



-- 
Bernardo E.C.

Sent from a cheap desktop computer in South America.

Reply via email to