Re: [Jprogramming] Finding multiple sequential strings

Raul Miller Tue, 22 Nov 2011 07:56:42 -0800

This version ignores duplicate tags.

Note that it's not precisely what you asked for -- it is not deleting the
entire tag sequence, it's only skipping over the conflicted tags.  If there
is another tag in the sequence which is not conflicted, it will still show
up.  This is because I do not identify the sequences until later.


dups=: ~.@#~ i.@# ~: i.~

getTagsContents=: 4 :0
 'n m'=. $tags=. > _2 <\ y
 txt=. ' ',x,;tags
 locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt
 locs=. (#~L:0 ([email protected]:0 dups@;)@:({:"1)) locs
 assert. -:&/:&;/ |:locs  NB. tags must be balanced
 data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
 expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
 }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)

Note that another approach might be to use a different technique to extract
the tag contents.  If I used character indices to extract them, then I
could relax the restriction that tags cannot overlap.

FYI,

-- 
Raul

On Mon, Nov 21, 2011 at 4:44 PM, Skip Cave <[email protected]> wrote:

> If the program detects an assert failure, it should find the whole tag
> sequence (tag1s, tag1e, tag2s, tag2e, etc). and should skip over that
> entire bad tag sequence. It should find the next appearence of the first
> start tag (tag1s) and process it as usual.
>
> Right now, when the assert fails, the whole program stops in the middle of
> processing, with no clue where the failure was. In a perfect world, the
> program would also  note the position of the failed text in a global
> variable, so I could inspect the failure later, as well as find out how
> many bad tag sets there were in the run. Generally the problem is a mangled
> log file. I probably won't be able to fix it anyway, so just skipping over
> the bad tag set is the best option.
>
> Skip
>
> On Mon, Nov 21, 2011 at 2:47 PM, Raul Miller <[email protected]>
> wrote:
>
> > That assert is checking for unbalanced tags.  You probably have two start
> > tags followed by one end tag.
> >
> > What do you want the program to do for this kind of thing?
> >
> > --
> > Raul
> >
> > On Mon, Nov 21, 2011 at 3:43 PM, Skip Cave <[email protected]>
> > wrote:
> >
> > > Raul's getTagsConterns function works great on my data. Here's the
> > > function:
> > >
> > > getTagsContents=: 4 :0
> > >  'n m'=. $tags=. > _2 <\ y
> > >  locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt=. ' ',x,;tags
> > >  assert. -:&/:&;/ |:locs  NB. tags must be balanced
> > >  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
> > >  expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
> > >  }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > > )
> > >
> > >
> > >
> > > However, I have a few logs that got garbled, and they fail Raul's
> assert
> > > test:
> > >
> > > ww1 is a boxed array with 1000 text log files in it, one log file per
> box
> > >
> > >   $ww1
> > > 1000
> > >   $ ; ww1
> > > 32842565
> > >
> > > tags4 is a noun containing the four tag pairs that bracket the text
> that
> > I
> > > need to extract using Raul's getTagsContents function
> > >
> > >   tags4
> > >
> > >
> >
> ┌──────┬───┬─────────┬───┬───────────────────────────┬───┬──────────────────┬───┐
> > > │STATUS│   │RESULT[0]│   │CONFIDENCE[0]             =│
> > > │UTTERANCE_FILENAME│   │
> > >
> > >
> >
> └──────┴───┴─────────┴───┴───────────────────────────┴───┴──────────────────┴───┘
> > >
> > > now we test:
> > >
> > >   ww1x =: (;ww1) getTagsContents  tags4
> > > |assertion failure: getTagsContents
> > > |   -:&/:&;/|:locs
> > >
> > >    ww1x =: (;  365 {. ww1) getTagsContents  tags4  NB. This works
> > >
> > >   ww1x =: (; 366 } ww1) getTagsContents   tags4
> > > |assertion failure: getTagsContents
> > > |   -:&/:&;/|:locs
> > >
> > > There's the culprit - box no 366 in ww1
> > >
> > > There also a couple of other garbled logs in ww1 that fail the
> assertion
> > > test.
> > >
> > > Is there any way to build the getTagsContents, so if a specofic boxed
> log
> > > fails assertion,
> > > the function will skip that boxed log and go to the next one?
> > >
> > > Skip
> > > .
> > > On Sat, Nov 19, 2011 at 1:01 PM, Skip Cave <[email protected]>
> > > wrote:
> > >
> > > > Raul
> > > >
> > > > That works like a charm! It gets all the parameters, and puts them in
> > the
> > > > right columns. Now I'll try it on a larger data file with real data
> in
> > > it:
> > > >
> > > >   $ww
> > > > 10         NB. ww has ten log files in it, one box per log file.
> > > >    $;ww
> > > > 969059  NB. ww unboxed and raveled is a long text string of catenated
> > log
> > > > files. Each log file has lots of events in it, and each event has
> lots
> > of
> > > > parameters.
> > > >
> > > >    a. i. crlftb
> > > > 13 10 9      NB. The verb crlftb has CR, LF, Tab in it.
> > > >
> > > > NB. This is the terminator string for all the lines in the log file.
> > > >
> > > > NB. I want the parameters on every lines starting with STATUS,
> > RESULT[0],
> > > > and CONFIDENCE[0]
> > > >
> > > > tags2 =: 'STATUS'; crlftb ; 'RESULT[0]'; crlftb ; 'CONFIDENCE[0]' ;
> > > crlftb
> > > >    tags2
> > > > ┌──────┬───┬─────────┬───┬─────────────┬───┐
> > > > │STATUS│   │RESULT[0]│   │CONFIDENCE[0]│   │
> > > > └──────┴───┴─────────┴───┴─────────────┴───┘
> > > >
> > > > NB. Now the acid test:
> > > >
> > > >  txt9 =:  (; ww) getTagsContents tags2
> > > >    $txt9
> > > > 120 3
> > > >
> > > > So there were 120 events in all the log files that had at least one
> of
> > > the
> > > > three parameter values we wanted, in them.
> > > >
> > > > Let's take a look:
> > > >
> > > >   cleanString1 10 {. 100 }. txt9
> > > > ┌───────────┬───────────┬─────────────────┐
> > > > │           │           │[0][__MRCP_GID] 0│
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │           │           │[0][__MRCP_STR] 0│
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │RECOGNITION│main menu  │75               │
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │           │           │[0][__MRCP_GID] 0│
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │           │           │[0][__MRCP_STR] 0│
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │RECOGNITION│ninety five│64               │
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │           │           │[0][__MRCP_GID] 0│
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │           │           │[0][__MRCP_STR] 0│
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │RECOGNITION│yes        │86               │
> > > > ├───────────┼───────────┼─────────────────┤
> > > > │           │           │[0][__MRCP_GID] 0│
> > > > └───────────┴───────────┴─────────────────┘
> > > >
> > > > Yes! that's it!
> > > >
> > > > Raul, Frasier, Björn, Linda, *thanks to all of you* for helping me on
> > > > this problem.
> > > >
> > > > Now I have to do this same thing to a few thousand log files instead
> of
> > > > just 10. Then I need to do all kinds of analysis on the resulting
> > data. I
> > > > think I know enough J to do the analysis part, but I still may have
> to
> > > ask
> > > > a question or two, if I get stuck.
> > > >
> > > > I'll let you all know how it goes....
> > > >
> > > > Skip
> > > >
> > > >
> > > > On Sat, Nov 19, 2011 at 11:24 AM, Raul Miller <[email protected]
> > > >wrote:
> > > >
> > > >> Note that this isn't really a new function -- it's the same one that
> > you
> > > >> posted (or would have posted, i think, if you had posted the last
> line
> > > of
> > > >> it).  Except, mine was from a version that had =: instead of =: for
> > its
> > > >> intermediate results.  That's bad, for production code, but it does
> > let
> > > us
> > > >> see what the bug is:
> > > >>
> > > >>   _4 ]\ expand #inv (+/expand){.data
> > > >>
> > > >>
> > >
> >
> ┌────────────────────┬───────────────────┬─────────────────────────┬────────────────────┐
> > > >> │param1              │param2             │param3
> > > │param5
> > > >>             │
> > > >>
> > > >>
> > >
> >
> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param1    =  12345  │param2    =   NONE │param3   =   hello world │
> > > >>             │
> > > >>
> > > >>
> > >
> >
> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │                    │                   │param1  = 34567
> > >  │param3
> > > >>  = hello bob │
> > > >>
> > > >>
> > >
> >
> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param5   - zero one │                   │
> > > │param5
> > > >> = two three  │
> > > >>
> > > >>
> > >
> >
> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param1 = 6789       │param2 = SOME      │                         │
> > > >>             │
> > > >>
> > > >>
> > >
> >
> ├────────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param1              │param2             │param3
> > > │param5
> > > >>             │
> > > >>
> > > >>
> > >
> >
> └────────────────────┴───────────────────┴─────────────────────────┴────────────────────┘
> > > >>
> > > >> I am not defining "expand" properly.  Thus, parameters are being
> > > >> misplaced.
> > > >>
> > > >> If I use an alternate definition for expand, it seems to get the
> > > >> parameters
> > > >> into the right places:
> > > >>
> > > >>   expand=: ;0 1 2 3 e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;)
> {."1
> > > locs
> > > >>   _4 ]\ expand #inv (+/expand){.data
> > > >>
> > > >>
> > >
> >
> ┌───────────────────┬───────────────────┬─────────────────────────┬────────────────────┐
> > > >> │param1             │param2             │param3
> > > │param5
> > > >>           │
> > > >>
> > > >>
> > >
> >
> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param1    =  12345 │param2    =   NONE │param3   =   hello world │
> > > >>           │
> > > >>
> > > >>
> > >
> >
> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param1  = 34567    │                   │param3  = hello bob
> > >  │param5
> > > >> - zero one │
> > > >>
> > > >>
> > >
> >
> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │                   │                   │
> > > │param5
> > > >> =
> > > >> two three  │
> > > >>
> > > >>
> > >
> >
> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param1 = 6789      │param2 = SOME      │                         │
> > > >>           │
> > > >>
> > > >>
> > >
> >
> ├───────────────────┼───────────────────┼─────────────────────────┼────────────────────┤
> > > >> │param1             │param2             │param3
> > > │param5
> > > >>           │
> > > >>
> > > >>
> > >
> >
> └───────────────────┴───────────────────┴─────────────────────────┴────────────────────┘
> > > >>
> > > >> ...and this also lets me clean up some unneeded stuff (I no longer
> > need
> > > to
> > > >> add the blank tags to the text I am working with, and so I no longer
> > > need
> > > >> to drop those rows from the result.. except it blows up if no tags
> are
> > > >> present, so I can't get rid of that entirely...
> > > >>
> > > >> Anyways, here's how it looks with this definition for expand:
> > > >>
> > > >> getTagsContents=: 4 :0
> > > >>  'n m'=. $tags=. > _2 <\ y
> > > >>   locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt=. '
> > ',x,;tags
> > > >>   assert. -:&/:&;/ |:locs  NB. tags must be balanced
> > > >>   data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
> > > >>  expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1
> > locs
> > > >>  }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > > >> )
> > > >>
> > > >> --
> > > >> Raul
> > > >>
> > > >>
> > >
> > >
> > > --
> > > Skip Cave
> > > Cave Consulting LLC
> > > Phone: 214-460-4861
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
>
> --
> Skip Cave
> Cave Consulting LLC
> Phone: 214-460-4861
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to