Re: [Jprogramming] Finding multiple sequential strings

Raul Miller Fri, 18 Nov 2011 10:12:03 -0800

By the way:

 locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags


This is defining two things:
   txt
   locs

txt is the text supplied in x, but preceded and followed by an empty set of
tags, and also ensuring that the very first character is not the start of a
tag.

The space may be unnecessary -- the only reason I am adding the space on
the front is so that I know whether or not the first block of text
delimited by tags is part of a tag or not.  (But I would have to re-think
the rest of the code to see if it could work based on the idea that the
start of the text is always a tag.)

Meanwhile, I prepend and append an empty set of tags to avoid issues with
how I am identifying "sequences of tags".  One of your requirements was
that tags always appear in order, and if any are missing from that order we
get a blank result in the corresponding result position.  And I am using
J's diagonal function to find the subsequences.  And to ensure that found
subsequences are always complete, I start and end the text with complete
subsequences (otherwise, with certain data, I might
have incomplete diagonals...).

Anyways, once I have that, loc becomes the locations of the beginings of
your start tags (left column) in txt and of the beginnings of your end tags
(right column) in txt.

Except, of course, end tags are not just defined by their text, but also by
their location.  (Which makes me wonder if perhaps this code should instead
be structured to just have a single "end" delimiter instead of trying to
pretend that it's reasonable to have different ending tags for different
starting tags.)

Finally, I should have replace =: with =. in the function definition before
releasing it.  (=: makes some kinds of debugging easier but can cause
long-run problems).


getTagsContents=: 4 :0
  'n m'=. $tags=. > _2 <\ y
  locs=.  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
  locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs
  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
  expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)


-- 
Raul




On Thu, Nov 17, 2011 at 6:54 PM, Skip Cave <[email protected]> wrote:

> I think the problem we are having is that the closing crlftb tag string
> appears in many other places in the file, besides as a closing tag for the
> opening tags. There are many more crlftb strings in the text than there are
> opening tag strings.
>
> So the correct statement is that the opening tags are unique, and will
> always start a required text string. Closing tags are not necessarily
> unique, and will close the required strings,  as well as appear in other
> places in the file, which can be ignored. Only the first closing tag string
> that appears in the text following a unique opening tag is valid as the
> terminating tag for the text to be extracted.
>
> The function should find the opening tag, and then capture all of the text
> up to the first occurrence of the crlftb closing tag. It should ignore all
> subsequent crlftb tags until after it finds a unique opening tag, then
> again capture all of the text up to the first tag end string, which in this
> case will again be the crlftb string.
>
> It looks like the problem is in this line of the function:
>  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
>
> But I haven't gotten my head around all that it is doing as yet.
>
> Here's the whole function, with the assert line commented out.
>
> getTagsContents=: 4 :0
>  'n m'=. $tags=. > _2 <\ y
>  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
>  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
>  data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>  expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
>  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> )
>
> Skip
>
> On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]>
> wrote:
>
> > Yes. As I said in my previous post, the assert statement has been
> > commented out. It would throw an error, if the assert wasn't commented
> out.
> > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]> wrote:
> >
> >> Did you try just removing the assert?
> >>
> >> Thanks,
> >>
> >> --
> >> Raul
> >>
> >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <[email protected]
> >wrote:
> >>
> >>> I stated that wrong.
> >>>
> >>>    (  ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> >>> ┌┬──────────────────────────────────────────────────────────┐
> >>> ││_HOSTNAME           = Unknown <connected via resource mgr>│
> >>> └┴──────────────────────────────────────────────────────────┘
> >>>
> >>> It doesn't find the first tag pair, and for some reason, it captures
> >>> *part
> >>> of* the string *following* the first tag pair.
> >>>
> >>> Skip
> >>>
> >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <[email protected]>
> >>> wrote:
> >>>
> >>> > Raul,
> >>> >
> >>> > In my application, the tag pairs will never overlap. Also, the
> leading
> >>> tag
> >>> > of a particular string will always be unique. However, it is handy if
> >>> I can
> >>> > define just the trailing tags of any tag pair to be all the same
> >>> string.
> >>> > This won't always be the case, as sometimes the closing tag may be
> >>> unique,
> >>> > so either case should work. Here's an example of some real data in my
> >>> text
> >>> > log file. I just pulled a section of the text out of the middle of
> the
> >>> log:
> >>> >
> >>> >    ww2
> >>> >
> >>> >     PROMPT_DURATION           = 1.968
> >>> >     STATUS                    = RECOGNITION
> >>> >     SERVER_HOSTNAME           = Unknown <connected via resource mgr>
> >>> >     NUM_RESULTS               = 1
> >>> >     RESULT[0]                 = dtmf-9 dtmf-0 dtmf-4 dtmf-7 dtmf-2
> >>> >     CONFIDENCE[0]
> >>> >
> >>> > Let's look at the data:
> >>> >    $ ww2
> >>> > 260
> >>> >    q: 260
> >>> > 2 2 5 13
> >>> >
> >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array
> >>> >
> >>> >     13 20  $ a. i. ww2
> >>> >  10   9  80  82  79  77 80  84  95  68  85  82  65  84  73  79  78
>  32
> >>> > 32  32
> >>> >  32  32  32  32  32  32 32  32  61  32  49  46  57  54  56  13  10
> 9
> >>> > 83  84
> >>> >  65  84  85  83  32  32 32  32  32  32  32  32  32  32  32  32  32
>  32
> >>> > 32  32
> >>> >  32  32  32  32  61  32 82  69  67  79  71  78  73  84  73  79  78
>  13
> >>> > 10   9
> >>> >  83  69  82  86  69  82 95  72  79  83  84  78  65  77  69  32  32
>  32
> >>> > 32  32
> >>> >  32  32  32  32  32  32 61  32  85 110 107 110 111 119 110  32  60
>  99
> >>> 111
> >>> > 110
> >>> > 110 101  99 116 101 100 32 118 105  97  32 114 101 115 111 117 114
>  99
> >>> > 101  32
> >>> > 109 103 114  62  13  10  9  78  85  77  95  82  69  83  85  76  84
>  83
> >>> > 32  32
> >>> >  32  32  32  32  32  32 32  32  32  32  32  32  32  61  32  49  13
>  10
> >>> > 9  82
> >>> >  69  83  85  76  84  91 48  93  32  32  32  32  32  32  32  32  32
>  32
> >>> > 32  32
> >>> >  32  32  32  32  32  61 32 100 116 109 102  45  57  32 100 116 109
> 102
> >>> > 45  48
> >>> >  32 100 116 109 102  45 52  32 100 116 109 102  45  55  32 100 116
> 109
> >>> > 102  45
> >>> >  50  13  10   9  67  79 78  70  73  68  69  78  67  69  91  48  93
>  32
> >>> > 32  32
> >>> >
> >>> > You can see that each line is terminated with a 13, 10, 9 character
> >>> string
> >>> > (CR, LF, TAB)
> >>> >
> >>> > we check:
> >>> >
> >>> >    a. i. crlftb
> >>> > 13 10 9
> >>> >
> >>> > Also the crlftb noun contains the three characters that terminate
> each
> >>> > line.
> >>> >
> >>> > I want to capture the row starting with 'STATUS' and the row starting
> >>> with
> >>> > 'RESULT[0]',
> >>> > Both rows terminate with the carriage return, line feed, tab
> sequence.
> >>> > I have commented the assert. statement out of your getTagsContent, so
> >>> I no
> >>> > longer get the error:
> >>> >
> >>> > Now I run the function:
> >>> >
> >>> >     ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> >>> > ┌┬──────────────────────────────────────────────────────────┐
> >>> > ││_HOSTNAME           = Unknown <connected via resource mgr>│
> >>> > └┴──────────────────────────────────────────────────────────┘
> >>> >
> >>> > Weird. I get the line AFTER the one I want, and it completely misses
> >>> the
> >>> > second tag pair.
> >>> > Any ideas what is going on?
> >>> >
> >>> > Skip
> >>> >
> >>> >
> >>> >
> >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <[email protected]
> >>> >wrote:
> >>> >
> >>> >> P.S. Brian Schott has observed that if you take out the assertion
> >>> that is
> >>> >> checking for balanced tags that works on the crlftab example.
> >>> >>
> >>> >> This is because my code also has an assumption that tags cannot
> >>> overlap.
> >>> >>
> >>> >> I have not thought through what this would mean for other cases that
> >>> would
> >>> >> currently be rejected.
> >>>
> >>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to