Here's a variation which emits a warning when tags overlap:
dups=: ~.@#~ i.@# ~: i.~
getTagsContents=: 4 :0
'n m'=. $tags=. > _2 <\ y
txt=. ' ',x,;tags
locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt
overlapped=. dups;{:"1 locs
if. #overlapped do.
smoutput 'Ignoring overlapped tags on line(s): ',":1+(I.txt=LF) I.
overlapped
locs=. (#~L:0 ([email protected]:0 dups@;)@:({:"1)) locs
end.
assert. -:&/:&;/ |:locs NB. tags must be balanced
data=. _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
}: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)
I should also note that a pair of overlapped tags might span two tag
sequences. And I suspect that deleting all damaged sequences (all tag
sequences which would have contained damaged tags) would just about double
the complexity of the program -- and I doubt that that's worth doing, given
that the system already allows damaged tags.
--
Raul
On Tue, Nov 22, 2011 at 10:54 AM, Raul Miller <[email protected]> wrote:
> This version ignores duplicate tags.
>
> Note that it's not precisely what you asked for -- it is not deleting the
> entire tag sequence, it's only skipping over the conflicted tags. If there
> is another tag in the sequence which is not conflicted, it will still show
> up. This is because I do not identify the sequences until later.
>
> dups=: ~.@#~ i.@# ~: i.~
>
> getTagsContents=: 4 :0
> 'n m'=. $tags=. > _2 <\ y
> txt=. ' ',x,;tags
> locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt
> locs=. (#~L:0 ([email protected]:0 dups@;)@:({:"1)) locs
> assert. -:&/:&;/ |:locs NB. tags must be balanced
> data=. _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
> expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
> }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> )
>
> Note that another approach might be to use a different technique to
> extract the tag contents. If I used character indices to extract them,
> then I could relax the restriction that tags cannot overlap.
>
> FYI,
>
> --
> Raul
>
>
> On Mon, Nov 21, 2011 at 4:44 PM, Skip Cave <[email protected]>wrote:
>
>> If the program detects an assert failure, it should find the whole tag
>> sequence (tag1s, tag1e, tag2s, tag2e, etc). and should skip over that
>> entire bad tag sequence. It should find the next appearence of the first
>> start tag (tag1s) and process it as usual.
>>
>> Right now, when the assert fails, the whole program stops in the middle of
>> processing, with no clue where the failure was. In a perfect world, the
>> program would also note the position of the failed text in a global
>> variable, so I could inspect the failure later, as well as find out how
>> many bad tag sets there were in the run. Generally the problem is a
>> mangled
>> log file. I probably won't be able to fix it anyway, so just skipping over
>> the bad tag set is the best option.
>>
>> Skip
>>
>> On Mon, Nov 21, 2011 at 2:47 PM, Raul Miller <[email protected]>
>> wrote:
>>
>> > That assert is checking for unbalanced tags. You probably have two
>> start
>> > tags followed by one end tag.
>> >
>> > What do you want the program to do for this kind of thing?
>> >
>> > --
>> > Raul
>> >
>> > On Mon, Nov 21, 2011 at 3:43 PM, Skip Cave <[email protected]>
>> > wrote:
>> >
>> > > Raul's getTagsConterns function works great on my data. Here's the
>> > > function:
>> > >
>> > > getTagsContents=: 4 :0
>> > > 'n m'=. $tags=. > _2 <\ y
>> > > locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt=. '
>> ',x,;tags
>> > > assert. -:&/:&;/ |:locs NB. tags must be balanced
>> > > data=. _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
>> > > expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1
>> locs
>> > > }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
>> > > )
>> > >
>> > >
>> > >
>> > > However, I have a few logs that got garbled, and they fail Raul's
>> assert
>> > > test:
>> > >
>> > > ww1 is a boxed array with 1000 text log files in it, one log file per
>> box
>> > >
>> > > $ww1
>> > > 1000
>> > > $ ; ww1
>> > > 32842565
>> > >
>> > > tags4 is a noun containing the four tag pairs that bracket the text
>> that
>> > I
>> > > need to extract using Raul's getTagsContents function
>> > >
>> > > tags4
>> > >
>> > >
>> >
>> ┌──────┬───┬─────────┬───┬───────────────────────────┬───┬──────────────────┬───┐
>> > > │STATUS│ │RESULT[0]│ │CONFIDENCE[0] =│
>> > > │UTTERANCE_FILENAME│ │
>> > >
>> > >
>> >
>> └──────┴───┴─────────┴───┴───────────────────────────┴───┴──────────────────┴───┘
>> > >
>> > > now we test:
>> > >
>> > > ww1x =: (;ww1) getTagsContents tags4
>> > > |assertion failure: getTagsContents
>> > > | -:&/:&;/|:locs
>> > >
>> > > ww1x =: (; 365 {. ww1) getTagsContents tags4 NB. This works
>> > >
>> > > ww1x =: (; 366 } ww1) getTagsContents tags4
>> > > |assertion failure: getTagsContents
>> > > | -:&/:&;/|:locs
>> > >
>> > > There's the culprit - box no 366 in ww1
>> > >
>> > > There also a couple of other garbled logs in ww1 that fail the
>> assertion
>> > > test.
>> > >
>> > > Is there any way to build the getTagsContents, so if a specofic boxed
>> log
>> > > fails assertion,
>> > > the function will skip that boxed log and go to the next one?
>> > >
>> > > Skip
>> > > .
>> > > On Sat, Nov 19, 2011 at 1:01 PM, Skip Cave <[email protected]>
>> > > wrote:
>> > >
>> > > > Raul
>> > > >
>> > > > That works like a charm! It gets all the parameters, and puts them
>> in
>> > the
>> > > > right columns. Now I'll try it on a larger data file with real data
>> in
>> > > it:
>> > > >
>> > > > $ww
>> > > > 10 NB. ww has ten log files in it, one box per log file.
>> > > > $;ww
>> > > > 969059 NB. ww unboxed and raveled is a long text string of
>> catenated
>> > log
>> > > > files. Each log file has lots of events in it, and each event has
>> lots
>> > of
>> > > > parameters.
>> > > >
>> > > > a. i. crlftb
>> > > > 13 10 9 NB. The verb crlftb has CR, LF, Tab in it.
>> > > >
>> > > > NB. This is the terminator string for all the lines in the log file.
>> > > >
>> > > > NB. I want the parameters on every lines starting with STATUS,
>> > RESULT[0],
>> > > > and CONFIDENCE[0]
>> > > >
>> > > > tags2 =: 'STATUS'; crlftb ; 'RESULT[0]'; crlftb ; 'CONFIDENCE[0]' ;
>> > > crlftb
>> > > > tags2
>> > > > ┌──────┬───┬─────────┬───┬─────────────┬───┐
>> > > > │STATUS│ │RESULT[0]│ │CONFIDENCE[0]│ │
>> > > > └──────┴───┴─────────┴───┴─────────────┴───┘
>> > > >
>> > > > NB. Now the acid test:
>> > > >
>> > > > txt9 =: (; ww) getTagsContents tags2
>> > > > $txt9
>> > > > 120 3
>> > > >
>> > > > So there were 120 events in all the log files that had at least one
>> of
>> > > the
>> > > > three parameter values we wanted, in them.
>> > > >
>> > > > Let's take a look:
>> > > >
>> > > > cleanString1 10 {. 100 }. txt9
>> > > > ┌───────────┬───────────┬─────────────────┐
>> > > > │ │ │[0][__MRCP_GID] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │ │ │[0][__MRCP_STR] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │RECOGNITION│main menu │75 │
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │ │ │[0][__MRCP_GID] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │ │ │[0][__MRCP_STR] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │RECOGNITION│ninety five│64 │
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │ │ │[0][__MRCP_GID] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │ │ │[0][__MRCP_STR] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │RECOGNITION│yes │86 │
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │ │ │[0][__MRCP_GID] 0│
>> > > > └───────────┴───────────┴─────────────────┘
>> > > >
>> > > > Yes! that's it!
>> > > >
>> > > > Raul, Frasier, Björn, Linda, *thanks to all of you* for helping me
>> on
>> > > > this problem.
>> > > >
>> > > > Now I have to do this same thing to a few thousand log files
>> instead of
>> > > > just 10. Then I need to do all kinds of analysis on the resulting
>> > data. I
>> > > > think I know enough J to do the analysis part, but I still may have
>> to
>> > > ask
>> > > > a question or two, if I get stuck.
>> > > >
>> > > > I'll let you all know how it goes....
>> > > >
>> > > > Skip
>> > > >
>> > > >
>> > > > On Sat, Nov 19, 2011 at 11:24 AM, Raul Miller <
>> [email protected]
>> > > >wrote:
>> > > >
>> > > >> Note that this isn't really a new function -- it's the same one
>> that
>> > you
>> > > >> posted (or would have posted, i think, if you had posted the last
>> line
>> > > of
>> > > >> it). Except, mine was from a version that had =: instead of =: for
>> > its
>> > > >> intermediate results. That's bad, for production code, but it does
>> > let
>> > > us
>> > > >> see what the bug is:
>> > > >>
>> > > >> _4 ]\ expand #inv (+/expand){.data
>> > > >>
>> > > >>
>> > >
>> >
>> ┌────────────────────┬───────────────────┬─────────────────────────┬────────────────────┐
>> > > >> │param1 │param2 │param3
>> > > │param5
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 = 12345 │param2 = NONE │param3 = hello world
>> │
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │ │ │param1 = 34567
>> > > │param3
>> > > >> = hello bob │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param5 - zero one │ │
>> > > │param5
>> > > >> = two three │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 = 6789 │param2 = SOME │
>> │
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 │param2 │param3
>> > > │param5
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> └────────────────────┴───────────────────┴─────────────────────────┴────────────────────┘
>> > > >>
>> > > >> I am not defining "expand" properly. Thus, parameters are being
>> > > >> misplaced.
>> > > >>
>> > > >> If I use an alternate definition for expand, it seems to get the
>> > > >> parameters
>> > > >> into the right places:
>> > > >>
>> > > >> expand=: ;0 1 2 3 e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;)
>> {."1
>> > > locs
>> > > >> _4 ]\ expand #inv (+/expand){.data
>> > > >>
>> > > >>
>> > >
>> >
>> ┌───────────────────┬───────────────────┬─────────────────────────┬────────────────────┐
>> > > >> │param1 │param2 │param3
>> > > │param5
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 = 12345 │param2 = NONE │param3 = hello world │
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 = 34567 │ │param3 = hello bob
>> > > │param5
>> > > >> - zero one │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │ │ │
>> > > │param5
>> > > >> =
>> > > >> two three │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 = 6789 │param2 = SOME │ │
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 │param2 │param3
>> > > │param5
>> > > >> │
>> > > >>
>> > > >>
>> > >
>> >
>> └───────────────────┴───────────────────┴─────────────────────────┴────────────────────┘
>> > > >>
>> > > >> ...and this also lets me clean up some unneeded stuff (I no longer
>> > need
>> > > to
>> > > >> add the blank tags to the text I am working with, and so I no
>> longer
>> > > need
>> > > >> to drop those rows from the result.. except it blows up if no tags
>> are
>> > > >> present, so I can't get rid of that entirely...
>> > > >>
>> > > >> Anyways, here's how it looks with this definition for expand:
>> > > >>
>> > > >> getTagsContents=: 4 :0
>> > > >> 'n m'=. $tags=. > _2 <\ y
>> > > >> locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt=. '
>> > ',x,;tags
>> > > >> assert. -:&/:&;/ |:locs NB. tags must be balanced
>> > > >> data=. _2 {:\ ((/:~ ; locs) I. i.#txt) </. txt
>> > > >> expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1
>> > locs
>> > > >> }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
>> > > >> )
>> > > >>
>> > > >> --
>> > > >> Raul
>> > > >>
>> > > >>
>> > >
>> > >
>> > > --
>> > > Skip Cave
>> > > Cave Consulting LLC
>> > > Phone: 214-460-4861
>> > > ----------------------------------------------------------------------
>> > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > >
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> >
>>
>>
>>
>> --
>> Skip Cave
>> Cave Consulting LLC
>> Phone: 214-460-4861
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm