Re: [Jprogramming] Finding multiple sequential strings

Skip Cave Thu, 17 Nov 2011 15:55:52 -0800

I think the problem we are having is that the closing crlftb tag string
appears in many other places in the file, besides as a closing tag for the
opening tags. There are many more crlftb strings in the text than there are
opening tag strings.


So the correct statement is that the opening tags are unique, and will
always start a required text string. Closing tags are not necessarily
unique, and will close the required strings,  as well as appear in other
places in the file, which can be ignored. Only the first closing tag string
that appears in the text following a unique opening tag is valid as the
terminating tag for the text to be extracted.

The function should find the opening tag, and then capture all of the text
up to the first occurrence of the crlftb closing tag. It should ignore all
subsequent crlftb tags until after it finds a unique opening tag, then
again capture all of the text up to the first tag end string, which in this
case will again be the crlftb string.

It looks like the problem is in this line of the function:
 locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags

But I haven't gotten my head around all that it is doing as yet.

Here's the whole function, with the assert line commented out.

getTagsContents=: 4 :0
 'n m'=. $tags=. > _2 <\ y
 locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
 NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
 data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
 expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
 }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)

Skip

On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]> wrote:

> Yes. As I said in my previous post, the assert statement has been
> commented out. It would throw an error, if the assert wasn't commented out.
> On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]> wrote:
>
>> Did you try just removing the assert?
>>
>> Thanks,
>>
>> --
>> Raul
>>
>> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <[email protected]>wrote:
>>
>>> I stated that wrong.
>>>
>>>    (  ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
>>> ┌┬──────────────────────────────────────────────────────────┐
>>> ││_HOSTNAME           = Unknown <connected via resource mgr>│
>>> └┴──────────────────────────────────────────────────────────┘
>>>
>>> It doesn't find the first tag pair, and for some reason, it captures
>>> *part
>>> of* the string *following* the first tag pair.
>>>
>>> Skip
>>>
>>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <[email protected]>
>>> wrote:
>>>
>>> > Raul,
>>> >
>>> > In my application, the tag pairs will never overlap. Also, the leading
>>> tag
>>> > of a particular string will always be unique. However, it is handy if
>>> I can
>>> > define just the trailing tags of any tag pair to be all the same
>>> string.
>>> > This won't always be the case, as sometimes the closing tag may be
>>> unique,
>>> > so either case should work. Here's an example of some real data in my
>>> text
>>> > log file. I just pulled a section of the text out of the middle of the
>>> log:
>>> >
>>> >    ww2
>>> >
>>> >     PROMPT_DURATION           = 1.968
>>> >     STATUS                    = RECOGNITION
>>> >     SERVER_HOSTNAME           = Unknown <connected via resource mgr>
>>> >     NUM_RESULTS               = 1
>>> >     RESULT[0]                 = dtmf-9 dtmf-0 dtmf-4 dtmf-7 dtmf-2
>>> >     CONFIDENCE[0]
>>> >
>>> > Let's look at the data:
>>> >    $ ww2
>>> > 260
>>> >    q: 260
>>> > 2 2 5 13
>>> >
>>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array
>>> >
>>> >     13 20  $ a. i. ww2
>>> >  10   9  80  82  79  77 80  84  95  68  85  82  65  84  73  79  78  32
>>> > 32  32
>>> >  32  32  32  32  32  32 32  32  61  32  49  46  57  54  56  13  10   9
>>> > 83  84
>>> >  65  84  85  83  32  32 32  32  32  32  32  32  32  32  32  32  32  32
>>> > 32  32
>>> >  32  32  32  32  61  32 82  69  67  79  71  78  73  84  73  79  78  13
>>> > 10   9
>>> >  83  69  82  86  69  82 95  72  79  83  84  78  65  77  69  32  32  32
>>> > 32  32
>>> >  32  32  32  32  32  32 61  32  85 110 107 110 111 119 110  32  60  99
>>> 111
>>> > 110
>>> > 110 101  99 116 101 100 32 118 105  97  32 114 101 115 111 117 114  99
>>> > 101  32
>>> > 109 103 114  62  13  10  9  78  85  77  95  82  69  83  85  76  84  83
>>> > 32  32
>>> >  32  32  32  32  32  32 32  32  32  32  32  32  32  61  32  49  13  10
>>> > 9  82
>>> >  69  83  85  76  84  91 48  93  32  32  32  32  32  32  32  32  32  32
>>> > 32  32
>>> >  32  32  32  32  32  61 32 100 116 109 102  45  57  32 100 116 109 102
>>> > 45  48
>>> >  32 100 116 109 102  45 52  32 100 116 109 102  45  55  32 100 116 109
>>> > 102  45
>>> >  50  13  10   9  67  79 78  70  73  68  69  78  67  69  91  48  93  32
>>> > 32  32
>>> >
>>> > You can see that each line is terminated with a 13, 10, 9 character
>>> string
>>> > (CR, LF, TAB)
>>> >
>>> > we check:
>>> >
>>> >    a. i. crlftb
>>> > 13 10 9
>>> >
>>> > Also the crlftb noun contains the three characters that terminate each
>>> > line.
>>> >
>>> > I want to capture the row starting with 'STATUS' and the row starting
>>> with
>>> > 'RESULT[0]',
>>> > Both rows terminate with the carriage return, line feed, tab sequence.
>>> > I have commented the assert. statement out of your getTagsContent, so
>>> I no
>>> > longer get the error:
>>> >
>>> > Now I run the function:
>>> >
>>> >     ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
>>> > ┌┬──────────────────────────────────────────────────────────┐
>>> > ││_HOSTNAME           = Unknown <connected via resource mgr>│
>>> > └┴──────────────────────────────────────────────────────────┘
>>> >
>>> > Weird. I get the line AFTER the one I want, and it completely misses
>>> the
>>> > second tag pair.
>>> > Any ideas what is going on?
>>> >
>>> > Skip
>>> >
>>> >
>>> >
>>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <[email protected]
>>> >wrote:
>>> >
>>> >> P.S. Brian Schott has observed that if you take out the assertion
>>> that is
>>> >> checking for balanced tags that works on the crlftab example.
>>> >>
>>> >> This is because my code also has an assumption that tags cannot
>>> overlap.
>>> >>
>>> >> I have not thought through what this would mean for other cases that
>>> would
>>> >> currently be rejected.
>>>
>>>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to