By the way: locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
This is defining two things: txt locs txt is the text supplied in x, but preceded and followed by an empty set of tags, and also ensuring that the very first character is not the start of a tag. The space may be unnecessary -- the only reason I am adding the space on the front is so that I know whether or not the first block of text delimited by tags is part of a tag or not. (But I would have to re-think the rest of the code to see if it could work based on the idea that the start of the text is always a tag.) Meanwhile, I prepend and append an empty set of tags to avoid issues with how I am identifying "sequences of tags". One of your requirements was that tags always appear in order, and if any are missing from that order we get a blank result in the corresponding result position. And I am using J's diagonal function to find the subsequences. And to ensure that found subsequences are always complete, I start and end the text with complete subsequences (otherwise, with certain data, I might have incomplete diagonals...). Anyways, once I have that, loc becomes the locations of the beginings of your start tags (left column) in txt and of the beginnings of your end tags (right column) in txt. Except, of course, end tags are not just defined by their text, but also by their location. (Which makes me wonder if perhaps this code should instead be structured to just have a single "end" delimiter instead of trying to pretend that it's reasonable to have different ending tags for different starting tags.) Finally, I should have replace =: with =. in the function definition before releasing it. (=: makes some kinds of debugging easier but can cause long-run problems). getTagsContents=: 4 :0 'n m'=. $tags=. > _2 <\ y locs=. tags [email protected]:0 }. txt=.(' ',;tags),x,;tags locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs NB. assert. -:&/:&;/ |:locs NB. tags must be balanced data=. _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data ) -- Raul On Thu, Nov 17, 2011 at 6:54 PM, Skip Cave <[email protected]> wrote: > I think the problem we are having is that the closing crlftb tag string > appears in many other places in the file, besides as a closing tag for the > opening tags. There are many more crlftb strings in the text than there are > opening tag strings. > > So the correct statement is that the opening tags are unique, and will > always start a required text string. Closing tags are not necessarily > unique, and will close the required strings, as well as appear in other > places in the file, which can be ignored. Only the first closing tag string > that appears in the text following a unique opening tag is valid as the > terminating tag for the text to be extracted. > > The function should find the opening tag, and then capture all of the text > up to the first occurrence of the crlftb closing tag. It should ignore all > subsequent crlftb tags until after it finds a unique opening tag, then > again capture all of the text up to the first tag end string, which in this > case will again be the crlftb string. > > It looks like the problem is in this line of the function: > locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags > > But I haven't gotten my head around all that it is doing as yet. > > Here's the whole function, with the assert line commented out. > > getTagsContents=: 4 :0 > 'n m'=. $tags=. > _2 <\ y > locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags > NB. assert. -:&/:&;/ |:locs NB. tags must be balanced > data=: _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt > expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs > }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data > ) > > Skip > > On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]> > wrote: > > > Yes. As I said in my previous post, the assert statement has been > > commented out. It would throw an error, if the assert wasn't commented > out. > > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]> wrote: > > > >> Did you try just removing the assert? > >> > >> Thanks, > >> > >> -- > >> Raul > >> > >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <[email protected] > >wrote: > >> > >>> I stated that wrong. > >>> > >>> ( ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf > >>> ┌┬──────────────────────────────────────────────────────────┐ > >>> ││_HOSTNAME = Unknown <connected via resource mgr>│ > >>> └┴──────────────────────────────────────────────────────────┘ > >>> > >>> It doesn't find the first tag pair, and for some reason, it captures > >>> *part > >>> of* the string *following* the first tag pair. > >>> > >>> Skip > >>> > >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <[email protected]> > >>> wrote: > >>> > >>> > Raul, > >>> > > >>> > In my application, the tag pairs will never overlap. Also, the > leading > >>> tag > >>> > of a particular string will always be unique. However, it is handy if > >>> I can > >>> > define just the trailing tags of any tag pair to be all the same > >>> string. > >>> > This won't always be the case, as sometimes the closing tag may be > >>> unique, > >>> > so either case should work. Here's an example of some real data in my > >>> text > >>> > log file. I just pulled a section of the text out of the middle of > the > >>> log: > >>> > > >>> > ww2 > >>> > > >>> > PROMPT_DURATION = 1.968 > >>> > STATUS = RECOGNITION > >>> > SERVER_HOSTNAME = Unknown <connected via resource mgr> > >>> > NUM_RESULTS = 1 > >>> > RESULT[0] = dtmf-9 dtmf-0 dtmf-4 dtmf-7 dtmf-2 > >>> > CONFIDENCE[0] > >>> > > >>> > Let's look at the data: > >>> > $ ww2 > >>> > 260 > >>> > q: 260 > >>> > 2 2 5 13 > >>> > > >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array > >>> > > >>> > 13 20 $ a. i. ww2 > >>> > 10 9 80 82 79 77 80 84 95 68 85 82 65 84 73 79 78 > 32 > >>> > 32 32 > >>> > 32 32 32 32 32 32 32 32 61 32 49 46 57 54 56 13 10 > 9 > >>> > 83 84 > >>> > 65 84 85 83 32 32 32 32 32 32 32 32 32 32 32 32 32 > 32 > >>> > 32 32 > >>> > 32 32 32 32 61 32 82 69 67 79 71 78 73 84 73 79 78 > 13 > >>> > 10 9 > >>> > 83 69 82 86 69 82 95 72 79 83 84 78 65 77 69 32 32 > 32 > >>> > 32 32 > >>> > 32 32 32 32 32 32 61 32 85 110 107 110 111 119 110 32 60 > 99 > >>> 111 > >>> > 110 > >>> > 110 101 99 116 101 100 32 118 105 97 32 114 101 115 111 117 114 > 99 > >>> > 101 32 > >>> > 109 103 114 62 13 10 9 78 85 77 95 82 69 83 85 76 84 > 83 > >>> > 32 32 > >>> > 32 32 32 32 32 32 32 32 32 32 32 32 32 61 32 49 13 > 10 > >>> > 9 82 > >>> > 69 83 85 76 84 91 48 93 32 32 32 32 32 32 32 32 32 > 32 > >>> > 32 32 > >>> > 32 32 32 32 32 61 32 100 116 109 102 45 57 32 100 116 109 > 102 > >>> > 45 48 > >>> > 32 100 116 109 102 45 52 32 100 116 109 102 45 55 32 100 116 > 109 > >>> > 102 45 > >>> > 50 13 10 9 67 79 78 70 73 68 69 78 67 69 91 48 93 > 32 > >>> > 32 32 > >>> > > >>> > You can see that each line is terminated with a 13, 10, 9 character > >>> string > >>> > (CR, LF, TAB) > >>> > > >>> > we check: > >>> > > >>> > a. i. crlftb > >>> > 13 10 9 > >>> > > >>> > Also the crlftb noun contains the three characters that terminate > each > >>> > line. > >>> > > >>> > I want to capture the row starting with 'STATUS' and the row starting > >>> with > >>> > 'RESULT[0]', > >>> > Both rows terminate with the carriage return, line feed, tab > sequence. > >>> > I have commented the assert. statement out of your getTagsContent, so > >>> I no > >>> > longer get the error: > >>> > > >>> > Now I run the function: > >>> > > >>> > ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf > >>> > ┌┬──────────────────────────────────────────────────────────┐ > >>> > ││_HOSTNAME = Unknown <connected via resource mgr>│ > >>> > └┴──────────────────────────────────────────────────────────┘ > >>> > > >>> > Weird. I get the line AFTER the one I want, and it completely misses > >>> the > >>> > second tag pair. > >>> > Any ideas what is going on? > >>> > > >>> > Skip > >>> > > >>> > > >>> > > >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <[email protected] > >>> >wrote: > >>> > > >>> >> P.S. Brian Schott has observed that if you take out the assertion > >>> that is > >>> >> checking for balanced tags that works on the crlftab example. > >>> >> > >>> >> This is because my code also has an assumption that tags cannot > >>> overlap. > >>> >> > >>> >> I have not thought through what this would mean for other cases that > >>> would > >>> >> currently be rejected. > >>> > >>> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
