getTagsContents=: 4 :0
'n m'=. $tags=. > _2 <\ y
locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
locs=: (-@#@[ {. I. {./. ])&.>/\"1 locs
assert. -:&/:&;/ |:locs NB. tags must be balanced
data=: _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
}: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)
tags1 =: ('param1'; 'crlftb' ; 'param2'; 'crlftb' ; 'param3' ; 'crlftb'
;'param5' ; 'crlftb' )
(and textfile3 from Skip's message, below):
textfile3 getTagsContents tags1
┌──────────────┬─────────────┬───────────────────┬──────────────┐
│ = 12345 │ = NONE │ = hello world │ │
├──────────────┼─────────────┼───────────────────┼──────────────┤
│ │ │ = 34567 │ = hello bob │
├──────────────┼─────────────┼───────────────────┼──────────────┤
│ - zero one │ │ │ = two three │
├──────────────┼─────────────┼───────────────────┼──────────────┤
│ = 6789 │ = SOME │ │ │
└──────────────┴─────────────┴───────────────────┴──────────────┘
Note that I wrote it so that the flat text is the left argument and the
tags are the right argument.
FYI,
--
Raul
On Sat, Nov 19, 2011 at 1:15 AM, Skip Cave <[email protected]> wrote:
> OK, here;s Raul's function:
>
> getTagsContents=: 4 :0
> 'n m'=. $tags=. > _2 <\ y
> locs=. tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs
> NB. assert. -:&/:&;/ |:locs NB. tags must be balanced
> data=. _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
> expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
>
> Here's my data:
>
> textfile3=: 0 : 0
> Start of event 1 crlftb
> Some text crlftb
> Some more text crlftb
> param1 = 12345 crlftb
> param2 = NONE crlftb
> some comments crlftb
> param3 = hello world crlftb
> more comments crlftb
> param4 = 120.45 crlftb
> param6 = Test y crlftb
> End of stuff crlftb
> Start of event 2 crlftb
> Some text crlftb
> Some different text crlftb
> param1 = 34567 crlftb
> param3 = hello bob crlftb
> param4 = 32.89 crlftb
> comments and more comments crlftb
> param5 - zero one crlftb
> param 6 = Test z crlftb
> Second end crlftb
> Start of event 3 crlftb
> param5 = two three crlftb
> Start of event 4 crlftb
> stuff crlftb
> param1 = 6789 crlftb
> param2 = SOME crlftb
> end crlftb
> )
>
> $ textfile3
> 642
>
> I put the crlftb text on the end of each line, to show the equivalent of
> the invisible characters that are actually in the real data.
>
> Here is the string of tag pairs:
>
> tags1 =: ('param1'; 'crlftb' ; 'param2'; 'crlftb' ; 'param3' ; 'crlftb' ;
> 'param5' ; 'crlftb' )
> tags1
> ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
> │param1│crlftb│param2│crlftb│param3│crlftb│param5│crlftb│
> └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
>
> so now we try Raul's function:
>
> txt5 =: tags1 getTagsContents textfile3
> |domain error: getTagsContents
> | locs=.tags [email protected]:0}.txt=.(' ',;tags),x ,;tags
>
> Nope. What we should get is:
>
> ┌─────┬────┬───────────┬─────────┐
> │12345│NONE│hello world│ │
> ├─────┼────┼───────────┼─────────┤
> │34567│ │hello bob │zero one │
> ├─────┼────┼───────────┼─────────┤
> │ │ │ │two three│
> ├─────┼────┼───────────┼─────────┤
> │6789 │SOME│ │ │
> └─────┴────┴───────────┴─────────┘
>
> I should get a 4 x 4 boxed array with 12345, NONE, hello world, and an
> empty box in the first row of boxes
> The second row will have boxes containing 34567, empty box, hello bob, zero
> one
> The third row will have three empty boxes, and the fourth box will have
> 'two three' in it
> The fourth row will have a box with 6789, a box with SOME, and two empty
> boxes.
>
> Skip
>
>
> On Fri, Nov 18, 2011 at 12:10 PM, Raul Miller <[email protected]>
> wrote:
>
> > By the way:
> >
> > locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> >
> > This is defining two things:
> > txt
> > locs
> >
> > txt is the text supplied in x, but preceded and followed by an empty set
> of
> > tags, and also ensuring that the very first character is not the start
> of a
> > tag.
> >
> > The space may be unnecessary -- the only reason I am adding the space on
> > the front is so that I know whether or not the first block of text
> > delimited by tags is part of a tag or not. (But I would have to re-think
> > the rest of the code to see if it could work based on the idea that the
> > start of the text is always a tag.)
> >
> > Meanwhile, I prepend and append an empty set of tags to avoid issues with
> > how I am identifying "sequences of tags". One of your requirements was
> > that tags always appear in order, and if any are missing from that order
> we
> > get a blank result in the corresponding result position. And I am using
> > J's diagonal function to find the subsequences. And to ensure that found
> > subsequences are always complete, I start and end the text with complete
> > subsequences (otherwise, with certain data, I might
> > have incomplete diagonals...).
> >
> > Anyways, once I have that, loc becomes the locations of the beginings of
> > your start tags (left column) in txt and of the beginnings of your end
> tags
> > (right column) in txt.
> >
> > Except, of course, end tags are not just defined by their text, but also
> by
> > their location. (Which makes me wonder if perhaps this code should
> instead
> > be structured to just have a single "end" delimiter instead of trying to
> > pretend that it's reasonable to have different ending tags for different
> > starting tags.)
> >
> > Finally, I should have replace =: with =. in the function definition
> before
> > releasing it. (=: makes some kinds of debugging easier but can cause
> > long-run problems).
> >
> >
> > getTagsContents=: 4 :0
> > 'n m'=. $tags=. > _2 <\ y
> > locs=. tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> > locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs
> > NB. assert. -:&/:&;/ |:locs NB. tags must be balanced
> > data=. _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
> > expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
> > }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > )
> >
> >
> > --
> > Raul
> >
> >
> >
> >
> > On Thu, Nov 17, 2011 at 6:54 PM, Skip Cave <[email protected]>
> > wrote:
> >
> > > I think the problem we are having is that the closing crlftb tag string
> > > appears in many other places in the file, besides as a closing tag for
> > the
> > > opening tags. There are many more crlftb strings in the text than there
> > are
> > > opening tag strings.
> > >
> > > So the correct statement is that the opening tags are unique, and will
> > > always start a required text string. Closing tags are not necessarily
> > > unique, and will close the required strings, as well as appear in
> other
> > > places in the file, which can be ignored. Only the first closing tag
> > string
> > > that appears in the text following a unique opening tag is valid as the
> > > terminating tag for the text to be extracted.
> > >
> > > The function should find the opening tag, and then capture all of the
> > text
> > > up to the first occurrence of the crlftb closing tag. It should ignore
> > all
> > > subsequent crlftb tags until after it finds a unique opening tag, then
> > > again capture all of the text up to the first tag end string, which in
> > this
> > > case will again be the crlftb string.
> > >
> > > It looks like the problem is in this line of the function:
> > > locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> > >
> > > But I haven't gotten my head around all that it is doing as yet.
> > >
> > > Here's the whole function, with the assert line commented out.
> > >
> > > getTagsContents=: 4 :0
> > > 'n m'=. $tags=. > _2 <\ y
> > > locs=: tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> > > NB. assert. -:&/:&;/ |:locs NB. tags must be balanced
> > > data=: _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
> > > expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
> > > }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > > )
> > >
> > > Skip
> > >
> > > On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]>
> > > wrote:
> > >
> > > > Yes. As I said in my previous post, the assert statement has been
> > > > commented out. It would throw an error, if the assert wasn't
> commented
> > > out.
> > > > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]>
> wrote:
> > > >
> > > >> Did you try just removing the assert?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> --
> > > >> Raul
> > > >>
> > > >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <
> [email protected]
> > > >wrote:
> > > >>
> > > >>> I stated that wrong.
> > > >>>
> > > >>> ( ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > > >>> ┌┬──────────────────────────────────────────────────────────┐
> > > >>> ││_HOSTNAME = Unknown <connected via resource mgr>│
> > > >>> └┴──────────────────────────────────────────────────────────┘
> > > >>>
> > > >>> It doesn't find the first tag pair, and for some reason, it
> captures
> > > >>> *part
> > > >>> of* the string *following* the first tag pair.
> > > >>>
> > > >>> Skip
> > > >>>
> > > >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <
> [email protected]
> > >
> > > >>> wrote:
> > > >>>
> > > >>> > Raul,
> > > >>> >
> > > >>> > In my application, the tag pairs will never overlap. Also, the
> > > leading
> > > >>> tag
> > > >>> > of a particular string will always be unique. However, it is
> handy
> > if
> > > >>> I can
> > > >>> > define just the trailing tags of any tag pair to be all the same
> > > >>> string.
> > > >>> > This won't always be the case, as sometimes the closing tag may
> be
> > > >>> unique,
> > > >>> > so either case should work. Here's an example of some real data
> in
> > my
> > > >>> text
> > > >>> > log file. I just pulled a section of the text out of the middle
> of
> > > the
> > > >>> log:
> > > >>> >
> > > >>> > ww2
> > > >>> >
> > > >>> > PROMPT_DURATION = 1.968
> > > >>> > STATUS = RECOGNITION
> > > >>> > SERVER_HOSTNAME = Unknown <connected via resource
> > mgr>
> > > >>> > NUM_RESULTS = 1
> > > >>> > RESULT[0] = dtmf-9 dtmf-0 dtmf-4 dtmf-7
> dtmf-2
> > > >>> > CONFIDENCE[0]
> > > >>> >
> > > >>> > Let's look at the data:
> > > >>> > $ ww2
> > > >>> > 260
> > > >>> > q: 260
> > > >>> > 2 2 5 13
> > > >>> >
> > > >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array
> > > >>> >
> > > >>> > 13 20 $ a. i. ww2
> > > >>> > 10 9 80 82 79 77 80 84 95 68 85 82 65 84 73 79
> 78
> > > 32
> > > >>> > 32 32
> > > >>> > 32 32 32 32 32 32 32 32 61 32 49 46 57 54 56 13
> 10
> > > 9
> > > >>> > 83 84
> > > >>> > 65 84 85 83 32 32 32 32 32 32 32 32 32 32 32 32
> 32
> > > 32
> > > >>> > 32 32
> > > >>> > 32 32 32 32 61 32 82 69 67 79 71 78 73 84 73 79
> 78
> > > 13
> > > >>> > 10 9
> > > >>> > 83 69 82 86 69 82 95 72 79 83 84 78 65 77 69 32
> 32
> > > 32
> > > >>> > 32 32
> > > >>> > 32 32 32 32 32 32 61 32 85 110 107 110 111 119 110 32
> 60
> > > 99
> > > >>> 111
> > > >>> > 110
> > > >>> > 110 101 99 116 101 100 32 118 105 97 32 114 101 115 111 117
> 114
> > > 99
> > > >>> > 101 32
> > > >>> > 109 103 114 62 13 10 9 78 85 77 95 82 69 83 85 76
> 84
> > > 83
> > > >>> > 32 32
> > > >>> > 32 32 32 32 32 32 32 32 32 32 32 32 32 61 32 49
> 13
> > > 10
> > > >>> > 9 82
> > > >>> > 69 83 85 76 84 91 48 93 32 32 32 32 32 32 32 32
> 32
> > > 32
> > > >>> > 32 32
> > > >>> > 32 32 32 32 32 61 32 100 116 109 102 45 57 32 100 116
> 109
> > > 102
> > > >>> > 45 48
> > > >>> > 32 100 116 109 102 45 52 32 100 116 109 102 45 55 32 100
> 116
> > > 109
> > > >>> > 102 45
> > > >>> > 50 13 10 9 67 79 78 70 73 68 69 78 67 69 91 48
> 93
> > > 32
> > > >>> > 32 32
> > > >>> >
> > > >>> > You can see that each line is terminated with a 13, 10, 9
> character
> > > >>> string
> > > >>> > (CR, LF, TAB)
> > > >>> >
> > > >>> > we check:
> > > >>> >
> > > >>> > a. i. crlftb
> > > >>> > 13 10 9
> > > >>> >
> > > >>> > Also the crlftb noun contains the three characters that terminate
> > > each
> > > >>> > line.
> > > >>> >
> > > >>> > I want to capture the row starting with 'STATUS' and the row
> > starting
> > > >>> with
> > > >>> > 'RESULT[0]',
> > > >>> > Both rows terminate with the carriage return, line feed, tab
> > > sequence.
> > > >>> > I have commented the assert. statement out of your
> getTagsContent,
> > so
> > > >>> I no
> > > >>> > longer get the error:
> > > >>> >
> > > >>> > Now I run the function:
> > > >>> >
> > > >>> > ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > > >>> > ┌┬──────────────────────────────────────────────────────────┐
> > > >>> > ││_HOSTNAME = Unknown <connected via resource mgr>│
> > > >>> > └┴──────────────────────────────────────────────────────────┘
> > > >>> >
> > > >>> > Weird. I get the line AFTER the one I want, and it completely
> > misses
> > > >>> the
> > > >>> > second tag pair.
> > > >>> > Any ideas what is going on?
> > > >>> >
> > > >>> > Skip
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <
> > [email protected]
> > > >>> >wrote:
> > > >>> >
> > > >>> >> P.S. Brian Schott has observed that if you take out the
> assertion
> > > >>> that is
> > > >>> >> checking for balanced tags that works on the crlftab example.
> > > >>> >>
> > > >>> >> This is because my code also has an assumption that tags cannot
> > > >>> overlap.
> > > >>> >>
> > > >>> >> I have not thought through what this would mean for other cases
> > that
> > > >>> would
> > > >>> >> currently be rejected.
> > > >>>
> > > >>>
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
>
> --
> Skip Cave
> Cave Consulting LLC
> Phone: 214-460-4861
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm