Re: [Jprogramming] Finding multiple sequential strings

Raul Miller Fri, 18 Nov 2011 08:57:05 -0800

I think this will fix up locs, getting rid of the irrelevant extras:

  locs=: (-@#@[ {. I. {./. ])&.>/\"1 locs


Note, however, that this will be bad if you have start "tags" without end
"tags".

And note also that the final line in ww2 did not end with CR,LF,TAB

-- 
Raul

On Thu, Nov 17, 2011 at 6:54 PM, Skip Cave <[email protected]> wrote:

> I think the problem we are having is that the closing crlftb tag string
> appears in many other places in the file, besides as a closing tag for the
> opening tags. There are many more crlftb strings in the text than there are
> opening tag strings.
>
> So the correct statement is that the opening tags are unique, and will
> always start a required text string. Closing tags are not necessarily
> unique, and will close the required strings,  as well as appear in other
> places in the file, which can be ignored. Only the first closing tag string
> that appears in the text following a unique opening tag is valid as the
> terminating tag for the text to be extracted.
>
> The function should find the opening tag, and then capture all of the text
> up to the first occurrence of the crlftb closing tag. It should ignore all
> subsequent crlftb tags until after it finds a unique opening tag, then
> again capture all of the text up to the first tag end string, which in this
> case will again be the crlftb string.
>
> It looks like the problem is in this line of the function:
>  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
>
> But I haven't gotten my head around all that it is doing as yet.
>
> Here's the whole function, with the assert line commented out.
>
> getTagsContents=: 4 :0
>  'n m'=. $tags=. > _2 <\ y
>  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
>  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
>  data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>  expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
>  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> )
>
> Skip
>
> On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]>
> wrote:
>
> > Yes. As I said in my previous post, the assert statement has been
> > commented out. It would throw an error, if the assert wasn't commented
> out.
> > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]> wrote:
> >
> >> Did you try just removing the assert?
> >>
> >> Thanks,
> >>
> >> --
> >> Raul
> >>
> >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <[email protected]
> >wrote:
> >>
> >>> I stated that wrong.
> >>>
> >>>    (  ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> >>> ┌┬──────────────────────────────────────────────────────────┐
> >>> ││_HOSTNAME           = Unknown <connected via resource mgr>│
> >>> └┴──────────────────────────────────────────────────────────┘
> >>>
> >>> It doesn't find the first tag pair, and for some reason, it captures
> >>> *part
> >>> of* the string *following* the first tag pair.
> >>>
> >>> Skip
> >>>
> >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <[email protected]>
> >>> wrote:
> >>>
> >>> > Raul,
> >>> >
> >>> > In my application, the tag pairs will never overlap. Also, the
> leading
> >>> tag
> >>> > of a particular string will always be unique. However, it is handy if
> >>> I can
> >>> > define just the trailing tags of any tag pair to be all the same
> >>> string.
> >>> > This won't always be the case, as sometimes the closing tag may be
> >>> unique,
> >>> > so either case should work. Here's an example of some real data in my
> >>> text
> >>> > log file. I just pulled a section of the text out of the middle of
> the
> >>> log:
> >>> >
> >>> >    ww2
> >>> >
> >>> >     PROMPT_DURATION           = 1.968
> >>> >     STATUS                    = RECOGNITION
> >>> >     SERVER_HOSTNAME           = Unknown <connected via resource mgr>
> >>> >     NUM_RESULTS               = 1
> >>> >     RESULT[0]                 = dtmf-9 dtmf-0 dtmf-4 dtmf-7 dtmf-2
> >>> >     CONFIDENCE[0]
> >>> >
> >>> > Let's look at the data:
> >>> >    $ ww2
> >>> > 260
> >>> >    q: 260
> >>> > 2 2 5 13
> >>> >
> >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array
> >>> >
> >>> >     13 20  $ a. i. ww2
> >>> >  10   9  80  82  79  77 80  84  95  68  85  82  65  84  73  79  78
>  32
> >>> > 32  32
> >>> >  32  32  32  32  32  32 32  32  61  32  49  46  57  54  56  13  10
> 9
> >>> > 83  84
> >>> >  65  84  85  83  32  32 32  32  32  32  32  32  32  32  32  32  32
>  32
> >>> > 32  32
> >>> >  32  32  32  32  61  32 82  69  67  79  71  78  73  84  73  79  78
>  13
> >>> > 10   9
> >>> >  83  69  82  86  69  82 95  72  79  83  84  78  65  77  69  32  32
>  32
> >>> > 32  32
> >>> >  32  32  32  32  32  32 61  32  85 110 107 110 111 119 110  32  60
>  99
> >>> 111
> >>> > 110
> >>> > 110 101  99 116 101 100 32 118 105  97  32 114 101 115 111 117 114
>  99
> >>> > 101  32
> >>> > 109 103 114  62  13  10  9  78  85  77  95  82  69  83  85  76  84
>  83
> >>> > 32  32
> >>> >  32  32  32  32  32  32 32  32  32  32  32  32  32  61  32  49  13
>  10
> >>> > 9  82
> >>> >  69  83  85  76  84  91 48  93  32  32  32  32  32  32  32  32  32
>  32
> >>> > 32  32
> >>> >  32  32  32  32  32  61 32 100 116 109 102  45  57  32 100 116 109
> 102
> >>> > 45  48
> >>> >  32 100 116 109 102  45 52  32 100 116 109 102  45  55  32 100 116
> 109
> >>> > 102  45
> >>> >  50  13  10   9  67  79 78  70  73  68  69  78  67  69  91  48  93
>  32
> >>> > 32  32
> >>> >
> >>> > You can see that each line is terminated with a 13, 10, 9 character
> >>> string
> >>> > (CR, LF, TAB)
> >>> >
> >>> > we check:
> >>> >
> >>> >    a. i. crlftb
> >>> > 13 10 9
> >>> >
> >>> > Also the crlftb noun contains the three characters that terminate
> each
> >>> > line.
> >>> >
> >>> > I want to capture the row starting with 'STATUS' and the row starting
> >>> with
> >>> > 'RESULT[0]',
> >>> > Both rows terminate with the carriage return, line feed, tab
> sequence.
> >>> > I have commented the assert. statement out of your getTagsContent, so
> >>> I no
> >>> > longer get the error:
> >>> >
> >>> > Now I run the function:
> >>> >
> >>> >     ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> >>> > ┌┬──────────────────────────────────────────────────────────┐
> >>> > ││_HOSTNAME           = Unknown <connected via resource mgr>│
> >>> > └┴──────────────────────────────────────────────────────────┘
> >>> >
> >>> > Weird. I get the line AFTER the one I want, and it completely misses
> >>> the
> >>> > second tag pair.
> >>> > Any ideas what is going on?
> >>> >
> >>> > Skip
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <[email protected]
> >>> >wrote:
> >>> >
> >>> >> P.S. Brian Schott has observed that if you take out the assertion
> >>> that is
> >>> >> checking for balanced tags that works on the crlftab example.
> >>> >>
> >>> >> This is because my code also has an assumption that tags cannot
> >>> overlap.
> >>> >>
> >>> >> I have not thought through what this would mean for other cases that
> >>> would
> >>> >> currently be rejected.
> >>>
> >>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to