I think the problem we are having is that the closing crlftb tag string appears in many other places in the file, besides as a closing tag for the opening tags. There are many more crlftb strings in the text than there are opening tag strings.
So the correct statement is that the opening tags are unique, and will always start a required text string. Closing tags are not necessarily unique, and will close the required strings, as well as appear in other places in the file, which can be ignored. Only the first closing tag string that appears in the text following a unique opening tag is valid as the terminating tag for the text to be extracted. The function should find the opening tag, and then capture all of the text up to the first occurrence of the crlftb closing tag. It should ignore all subsequent crlftb tags until after it finds a unique opening tag, then again capture all of the text up to the first tag end string, which in this case will again be the crlftb string. It looks like the problem is in this line of the function: locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags But I haven't gotten my head around all that it is doing as yet. Here's the whole function, with the assert line commented out. getTagsContents=: 4 :0 'n m'=. $tags=. > _2 <\ y locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags NB. assert. -:&/:&;/ |:locs NB. tags must be balanced data=: _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data ) Skip On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]> wrote: > Yes. As I said in my previous post, the assert statement has been > commented out. It would throw an error, if the assert wasn't commented out. > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]> wrote: > >> Did you try just removing the assert? >> >> Thanks, >> >> -- >> Raul >> >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <[email protected]>wrote: >> >>> I stated that wrong. >>> >>> ( ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf >>> ┌┬──────────────────────────────────────────────────────────┐ >>> ││_HOSTNAME = Unknown <connected via resource mgr>│ >>> └┴──────────────────────────────────────────────────────────┘ >>> >>> It doesn't find the first tag pair, and for some reason, it captures >>> *part >>> of* the string *following* the first tag pair. >>> >>> Skip >>> >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <[email protected]> >>> wrote: >>> >>> > Raul, >>> > >>> > In my application, the tag pairs will never overlap. Also, the leading >>> tag >>> > of a particular string will always be unique. However, it is handy if >>> I can >>> > define just the trailing tags of any tag pair to be all the same >>> string. >>> > This won't always be the case, as sometimes the closing tag may be >>> unique, >>> > so either case should work. Here's an example of some real data in my >>> text >>> > log file. I just pulled a section of the text out of the middle of the >>> log: >>> > >>> > ww2 >>> > >>> > PROMPT_DURATION = 1.968 >>> > STATUS = RECOGNITION >>> > SERVER_HOSTNAME = Unknown <connected via resource mgr> >>> > NUM_RESULTS = 1 >>> > RESULT[0] = dtmf-9 dtmf-0 dtmf-4 dtmf-7 dtmf-2 >>> > CONFIDENCE[0] >>> > >>> > Let's look at the data: >>> > $ ww2 >>> > 260 >>> > q: 260 >>> > 2 2 5 13 >>> > >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array >>> > >>> > 13 20 $ a. i. ww2 >>> > 10 9 80 82 79 77 80 84 95 68 85 82 65 84 73 79 78 32 >>> > 32 32 >>> > 32 32 32 32 32 32 32 32 61 32 49 46 57 54 56 13 10 9 >>> > 83 84 >>> > 65 84 85 83 32 32 32 32 32 32 32 32 32 32 32 32 32 32 >>> > 32 32 >>> > 32 32 32 32 61 32 82 69 67 79 71 78 73 84 73 79 78 13 >>> > 10 9 >>> > 83 69 82 86 69 82 95 72 79 83 84 78 65 77 69 32 32 32 >>> > 32 32 >>> > 32 32 32 32 32 32 61 32 85 110 107 110 111 119 110 32 60 99 >>> 111 >>> > 110 >>> > 110 101 99 116 101 100 32 118 105 97 32 114 101 115 111 117 114 99 >>> > 101 32 >>> > 109 103 114 62 13 10 9 78 85 77 95 82 69 83 85 76 84 83 >>> > 32 32 >>> > 32 32 32 32 32 32 32 32 32 32 32 32 32 61 32 49 13 10 >>> > 9 82 >>> > 69 83 85 76 84 91 48 93 32 32 32 32 32 32 32 32 32 32 >>> > 32 32 >>> > 32 32 32 32 32 61 32 100 116 109 102 45 57 32 100 116 109 102 >>> > 45 48 >>> > 32 100 116 109 102 45 52 32 100 116 109 102 45 55 32 100 116 109 >>> > 102 45 >>> > 50 13 10 9 67 79 78 70 73 68 69 78 67 69 91 48 93 32 >>> > 32 32 >>> > >>> > You can see that each line is terminated with a 13, 10, 9 character >>> string >>> > (CR, LF, TAB) >>> > >>> > we check: >>> > >>> > a. i. crlftb >>> > 13 10 9 >>> > >>> > Also the crlftb noun contains the three characters that terminate each >>> > line. >>> > >>> > I want to capture the row starting with 'STATUS' and the row starting >>> with >>> > 'RESULT[0]', >>> > Both rows terminate with the carriage return, line feed, tab sequence. >>> > I have commented the assert. statement out of your getTagsContent, so >>> I no >>> > longer get the error: >>> > >>> > Now I run the function: >>> > >>> > ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf >>> > ┌┬──────────────────────────────────────────────────────────┐ >>> > ││_HOSTNAME = Unknown <connected via resource mgr>│ >>> > └┴──────────────────────────────────────────────────────────┘ >>> > >>> > Weird. I get the line AFTER the one I want, and it completely misses >>> the >>> > second tag pair. >>> > Any ideas what is going on? >>> > >>> > Skip >>> > >>> > >>> > >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <[email protected] >>> >wrote: >>> > >>> >> P.S. Brian Schott has observed that if you take out the assertion >>> that is >>> >> checking for balanced tags that works on the crlftab example. >>> >> >>> >> This is because my code also has an assumption that tags cannot >>> overlap. >>> >> >>> >> I have not thought through what this would mean for other cases that >>> would >>> >> currently be rejected. >>> >>> ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
