Re: [Jprogramming] Finding multiple sequential strings

Raul Miller Wed, 23 Nov 2011 09:22:59 -0800

Here's a variation which emits a warning when tags overlap:

dups=: ~.@#~ i.@# ~: i.~


getTagsContents=: 4 :0
 'n m'=. $tags=. > _2 <\ y
 txt=. ' ',x,;tags
 locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt
 overlapped=. dups;{:"1 locs
 if. #overlapped do.
   smoutput 'Ignoring overlapped tags on line(s): ',":1+(I.txt=LF) I.
overlapped
   locs=. (#~L:0 ([email protected]:0 dups@;)@:({:"1)) locs
 end.
 assert. -:&/:&;/ |:locs  NB. tags must be balanced
 data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
 expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
 }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)

I should also note that a pair of overlapped tags might span two tag
sequences.  And I suspect that deleting all damaged sequences (all tag
sequences which would have contained damaged tags) would just about double
the complexity of the program -- and I doubt that that's worth doing, given
that the system already allows damaged tags.

-- 
Raul

On Tue, Nov 22, 2011 at 10:54 AM, Raul Miller <[email protected]> wrote:

> This version ignores duplicate tags.
>
> Note that it's not precisely what you asked for -- it is not deleting the
> entire tag sequence, it's only skipping over the conflicted tags.  If there
> is another tag in the sequence which is not conflicted, it will still show
> up.  This is because I do not identify the sequences until later.
>
> dups=: ~.@#~ i.@# ~: i.~
>
> getTagsContents=: 4 :0
>  'n m'=. $tags=. > _2 <\ y
>  txt=. ' ',x,;tags
>  locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt
>  locs=. (#~L:0 ([email protected]:0 dups@;)@:({:"1)) locs
>  assert. -:&/:&;/ |:locs  NB. tags must be balanced
>  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>  expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
>  }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> )
>
> Note that another approach might be to use a different technique to
> extract the tag contents.  If I used character indices to extract them,
> then I could relax the restriction that tags cannot overlap.
>
> FYI,
>
> --
> Raul
>
>
> On Mon, Nov 21, 2011 at 4:44 PM, Skip Cave <[email protected]>wrote:
>
>> If the program detects an assert failure, it should find the whole tag
>> sequence (tag1s, tag1e, tag2s, tag2e, etc). and should skip over that
>> entire bad tag sequence. It should find the next appearence of the first
>> start tag (tag1s) and process it as usual.
>>
>> Right now, when the assert fails, the whole program stops in the middle of
>> processing, with no clue where the failure was. In a perfect world, the
>> program would also  note the position of the failed text in a global
>> variable, so I could inspect the failure later, as well as find out how
>> many bad tag sets there were in the run. Generally the problem is a
>> mangled
>> log file. I probably won't be able to fix it anyway, so just skipping over
>> the bad tag set is the best option.
>>
>> Skip
>>
>> On Mon, Nov 21, 2011 at 2:47 PM, Raul Miller <[email protected]>
>> wrote:
>>
>> > That assert is checking for unbalanced tags.  You probably have two
>> start
>> > tags followed by one end tag.
>> >
>> > What do you want the program to do for this kind of thing?
>> >
>> > --
>> > Raul
>> >
>> > On Mon, Nov 21, 2011 at 3:43 PM, Skip Cave <[email protected]>
>> > wrote:
>> >
>> > > Raul's getTagsConterns function works great on my data. Here's the
>> > > function:
>> > >
>> > > getTagsContents=: 4 :0
>> > >  'n m'=. $tags=. > _2 <\ y
>> > >  locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt=. '
>> ',x,;tags
>> > >  assert. -:&/:&;/ |:locs  NB. tags must be balanced
>> > >  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>> > >  expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1
>> locs
>> > >  }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
>> > > )
>> > >
>> > >
>> > >
>> > > However, I have a few logs that got garbled, and they fail Raul's
>> assert
>> > > test:
>> > >
>> > > ww1 is a boxed array with 1000 text log files in it, one log file per
>> box
>> > >
>> > >   $ww1
>> > > 1000
>> > >   $ ; ww1
>> > > 32842565
>> > >
>> > > tags4 is a noun containing the four tag pairs that bracket the text
>> that
>> > I
>> > > need to extract using Raul's getTagsContents function
>> > >
>> > >   tags4
>> > >
>> > >
>> >
>> ┌──────┬───┬─────────┬───┬───────────────────────────┬───┬──────────────────┬───┐
>> > > │STATUS│   │RESULT[0]│   │CONFIDENCE[0]             =│
>> > > │UTTERANCE_FILENAME│   │
>> > >
>> > >
>> >
>> └──────┴───┴─────────┴───┴───────────────────────────┴───┴──────────────────┴───┘
>> > >
>> > > now we test:
>> > >
>> > >   ww1x =: (;ww1) getTagsContents  tags4
>> > > |assertion failure: getTagsContents
>> > > |   -:&/:&;/|:locs
>> > >
>> > >    ww1x =: (;  365 {. ww1) getTagsContents  tags4  NB. This works
>> > >
>> > >   ww1x =: (; 366 } ww1) getTagsContents   tags4
>> > > |assertion failure: getTagsContents
>> > > |   -:&/:&;/|:locs
>> > >
>> > > There's the culprit - box no 366 in ww1
>> > >
>> > > There also a couple of other garbled logs in ww1 that fail the
>> assertion
>> > > test.
>> > >
>> > > Is there any way to build the getTagsContents, so if a specofic boxed
>> log
>> > > fails assertion,
>> > > the function will skip that boxed log and go to the next one?
>> > >
>> > > Skip
>> > > .
>> > > On Sat, Nov 19, 2011 at 1:01 PM, Skip Cave <[email protected]>
>> > > wrote:
>> > >
>> > > > Raul
>> > > >
>> > > > That works like a charm! It gets all the parameters, and puts them
>> in
>> > the
>> > > > right columns. Now I'll try it on a larger data file with real data
>> in
>> > > it:
>> > > >
>> > > >   $ww
>> > > > 10         NB. ww has ten log files in it, one box per log file.
>> > > >    $;ww
>> > > > 969059  NB. ww unboxed and raveled is a long text string of
>> catenated
>> > log
>> > > > files. Each log file has lots of events in it, and each event has
>> lots
>> > of
>> > > > parameters.
>> > > >
>> > > >    a. i. crlftb
>> > > > 13 10 9      NB. The verb crlftb has CR, LF, Tab in it.
>> > > >
>> > > > NB. This is the terminator string for all the lines in the log file.
>> > > >
>> > > > NB. I want the parameters on every lines starting with STATUS,
>> > RESULT[0],
>> > > > and CONFIDENCE[0]
>> > > >
>> > > > tags2 =: 'STATUS'; crlftb ; 'RESULT[0]'; crlftb ; 'CONFIDENCE[0]' ;
>> > > crlftb
>> > > >    tags2
>> > > > ┌──────┬───┬─────────┬───┬─────────────┬───┐
>> > > > │STATUS│   │RESULT[0]│   │CONFIDENCE[0]│   │
>> > > > └──────┴───┴─────────┴───┴─────────────┴───┘
>> > > >
>> > > > NB. Now the acid test:
>> > > >
>> > > >  txt9 =:  (; ww) getTagsContents tags2
>> > > >    $txt9
>> > > > 120 3
>> > > >
>> > > > So there were 120 events in all the log files that had at least one
>> of
>> > > the
>> > > > three parameter values we wanted, in them.
>> > > >
>> > > > Let's take a look:
>> > > >
>> > > >   cleanString1 10 {. 100 }. txt9
>> > > > ┌───────────┬───────────┬─────────────────┐
>> > > > │           │           │[0][__MRCP_GID] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │           │           │[0][__MRCP_STR] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │RECOGNITION│main menu  │75               │
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │           │           │[0][__MRCP_GID] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │           │           │[0][__MRCP_STR] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │RECOGNITION│ninety five│64               │
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │           │           │[0][__MRCP_GID] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │           │           │[0][__MRCP_STR] 0│
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │RECOGNITION│yes        │86               │
>> > > > ├───────────┼───────────┼─────────────────┤
>> > > > │           │           │[0][__MRCP_GID] 0│
>> > > > └───────────┴───────────┴─────────────────┘
>> > > >
>> > > > Yes! that's it!
>> > > >
>> > > > Raul, Frasier, Björn, Linda, *thanks to all of you* for helping me
>> on
>> > > > this problem.
>> > > >
>> > > > Now I have to do this same thing to a few thousand log files
>> instead of
>> > > > just 10. Then I need to do all kinds of analysis on the resulting
>> > data. I
>> > > > think I know enough J to do the analysis part, but I still may have
>> to
>> > > ask
>> > > > a question or two, if I get stuck.
>> > > >
>> > > > I'll let you all know how it goes....
>> > > >
>> > > > Skip
>> > > >
>> > > >
>> > > > On Sat, Nov 19, 2011 at 11:24 AM, Raul Miller <
>> [email protected]
>> > > >wrote:
>> > > >
>> > > >> Note that this isn't really a new function -- it's the same one
>> that
>> > you
>> > > >> posted (or would have posted, i think, if you had posted the last
>> line
>> > > of
>> > > >> it).  Except, mine was from a version that had =: instead of =: for
>> > its
>> > > >> intermediate results.  That's bad, for production code, but it does
>> > let
>> > > us
>> > > >> see what the bug is:
>> > > >>
>> > > >>   _4 ]\ expand #inv (+/expand){.data
>> > > >>
>> > > >>
>> > >
>> >
>> ┌────────────────────┬───────────────────┬─────────────────────────┬────────────────────┐
>> > > >> │param1              │param2             │param3
>> > > │param5
>> > > >>             │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1    =  12345  │param2    =   NONE │param3   =   hello world
>> │
>> > > >>             │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │                    │                   │param1  = 34567
>> > >  │param3
>> > > >>  = hello bob │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param5   - zero one │                   │
>> > > │param5
>> > > >> = two three  │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 = 6789       │param2 = SOME      │
>> │
>> > > >>             │
>> > > >>
>> > > >>
>> > >
>> >
>> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1              │param2             │param3
>> > > │param5
>> > > >>             │
>> > > >>
>> > > >>
>> > >
>> >
>> └────────────────────┴───────────────────┴─────────────────────────┴────────────────────┘
>> > > >>
>> > > >> I am not defining "expand" properly.  Thus, parameters are being
>> > > >> misplaced.
>> > > >>
>> > > >> If I use an alternate definition for expand, it seems to get the
>> > > >> parameters
>> > > >> into the right places:
>> > > >>
>> > > >>   expand=: ;0 1 2 3 e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;)
>> {."1
>> > > locs
>> > > >>   _4 ]\ expand #inv (+/expand){.data
>> > > >>
>> > > >>
>> > >
>> >
>> ┌───────────────────┬───────────────────┬─────────────────────────┬────────────────────┐
>> > > >> │param1             │param2             │param3
>> > > │param5
>> > > >>           │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1    =  12345 │param2    =   NONE │param3   =   hello world │
>> > > >>           │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1  = 34567    │                   │param3  = hello bob
>> > >  │param5
>> > > >> - zero one │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │                   │                   │
>> > > │param5
>> > > >> =
>> > > >> two three  │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1 = 6789      │param2 = SOME      │                         │
>> > > >>           │
>> > > >>
>> > > >>
>> > >
>> >
>> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
>> > > >> │param1             │param2             │param3
>> > > │param5
>> > > >>           │
>> > > >>
>> > > >>
>> > >
>> >
>> └───────────────────┴───────────────────┴─────────────────────────┴────────────────────┘
>> > > >>
>> > > >> ...and this also lets me clean up some unneeded stuff (I no longer
>> > need
>> > > to
>> > > >> add the blank tags to the text I am working with, and so I no
>> longer
>> > > need
>> > > >> to drop those rows from the result.. except it blows up if no tags
>> are
>> > > >> present, so I can't get rid of that entirely...
>> > > >>
>> > > >> Anyways, here's how it looks with this definition for expand:
>> > > >>
>> > > >> getTagsContents=: 4 :0
>> > > >>  'n m'=. $tags=. > _2 <\ y
>> > > >>   locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt=. '
>> > ',x,;tags
>> > > >>   assert. -:&/:&;/ |:locs  NB. tags must be balanced
>> > > >>   data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>> > > >>  expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1
>> > locs
>> > > >>  }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
>> > > >> )
>> > > >>
>> > > >> --
>> > > >> Raul
>> > > >>
>> > > >>
>> > >
>> > >
>> > > --
>> > > Skip Cave
>> > > Cave Consulting LLC
>> > > Phone: 214-460-4861
>> > > ----------------------------------------------------------------------
>> > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > >
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> >
>>
>>
>>
>> --
>> Skip Cave
>> Cave Consulting LLC
>> Phone: 214-460-4861
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to