Re: [Jprogramming] Finding multiple sequential strings

Skip Cave Wed, 23 Nov 2011 15:44:44 -0800

Both Arie and Raul posted updated functions. I will test each one on the
data I posted at:
https://www.opendrive.com/files?51418263_gn47v


I will try Arie's first:

   ww1A1 =: (ww1t) getFieldsV21 tags6
   $ww1A1
4917 4

  5 {. 500 }. ww1A1
┌───────────────────┬────┬────┬────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ RECOGNITION       │ no │ 73 │
C:\Nuance\V8.5.0\mrcp\logs\2011\10October\29\02-03-23-vx1prn123-7b42060a_00001cd4_4eab5eeb_0022_0000\utt04.wav
│
├───────────────────┼────┼────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ NO_SPEECH_TIMEOUT │    │
│
│
├───────────────────┼────┼────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ ABORTED           │    │
│
│
├───────────────────┼────┼────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ NO_SPEECH_TIMEOUT │    │
│
│
├───────────────────┼────┼────┼────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ NO_SPEECH_TIMEOUT │    │
│
│
└───────────────────┴────┴────┴────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Looks good!

Now we try Raul's function:

   tags3
┌──────┬───┬─────────┬───┬───────────────────────────┬───┐
│STATUS│   │RESULT[0]│   │CONFIDENCE[0]             =│   │
└──────┴───┴─────────┴───┴───────────────────────────┴───┘
   ww1R =: cleanString1 (ww1t) getTagsContents tags3
Ignoring overlapped tags on line(s): 6 25 41 158 194 215 258 282 287 299
307 341 381 414 441 443 452 481 484 571 574 610 677 712 748 811 855 1236
1268 1303 1350 1382 1449 1590 1635 1671 1707 1713 1725 1733 1767 1807 1840
1867 1869 1878 1907 1910 1997 2000 ...
|syntax error: getTagsContents
|       smoutput'Ignoring overlapped tags on line(s): ',":1+(I.txt=LF)I.
   $ww1R
$ ww1R


So Raul's function still stops and aborts when encountering mismatched
tags, though I haven't tried to look at the actual failing data. Raul says
it should skip over the mismatched tags, but it is stopping when it hits
one.

Skip

On Wed, Nov 23, 2011 at 11:21 AM, Raul Miller <[email protected]> wrote:

> Here's a variation which emits a warning when tags overlap:
>
> dups=: ~.@#~ i.@# ~: i.~
>
> getTagsContents=: 4 :0
>  'n m'=. $tags=. > _2 <\ y
>  txt=. ' ',x,;tags
>  locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt
>  overlapped=. dups;{:"1 locs
>  if. #overlapped do.
>   smoutput 'Ignoring overlapped tags on line(s): ',":1+(I.txt=LF) I.
> overlapped
>    locs=. (#~L:0 ([email protected]:0 dups@;)@:({:"1)) locs
>  end.
>  assert. -:&/:&;/ |:locs  NB. tags must be balanced
>  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>  expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
>  }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> )
>
> I should also note that a pair of overlapped tags might span two tag
> sequences.  And I suspect that deleting all damaged sequences (all tag
> sequences which would have contained damaged tags) would just about double
> the complexity of the program -- and I doubt that that's worth doing, given
> that the system already allows damaged tags.
>
> --
> Raul
>
> On Tue, Nov 22, 2011 at 10:54 AM, Raul Miller <[email protected]>
> wrote:
>
> > This version ignores duplicate tags.
> >
> > Note that it's not precisely what you asked for -- it is not deleting the
> > entire tag sequence, it's only skipping over the conflicted tags.  If
> there
> > is another tag in the sequence which is not conflicted, it will still
> show
> > up.  This is because I do not identify the sequences until later.
> >
> > dups=: ~.@#~ i.@# ~: i.~
> >
> > getTagsContents=: 4 :0
> >  'n m'=. $tags=. > _2 <\ y
> >  txt=. ' ',x,;tags
> >  locs=. (-@#@[ {. I. {./. ])&.>/\"1 tags [email protected]:0 }. txt
> >  locs=. (#~L:0 ([email protected]:0 dups@;)@:({:"1)) locs
> >  assert. -:&/:&;/ |:locs  NB. tags must be balanced
> >  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
> >  expand=. ;(i.n) e.L:0 (<;.1~ 1,2>:/\]) ,I. |:>(e.L:0~ /:~@;) {."1 locs
> >  }: (#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > )
> >
> > Note that another approach might be to use a different technique to
> > extract the tag contents.  If I used character indices to extract them,
> > then I could relax the restriction that tags cannot overlap.
> >
> > FYI,
> >
> > --
> > Raul
> >
> >
> > On Mon, Nov 21, 2011 at 4:44 PM, Skip Cave <[email protected]
> >wrote:
> >
> >> If the program detects an assert failure, it should find the whole tag
> >> sequence (tag1s, tag1e, tag2s, tag2e, etc). and should skip over that
> >> entire bad tag sequence. It should find the next appearence of the first
> >> start tag (tag1s) and process it as usual.
> >>
> >> Right now, when the assert fails, the whole program stops in the middle
> of
> >> processing, with no clue where the failure was. In a perfect world, the
> >> program would also  note the position of the failed text in a global
> >> variable, so I could inspect the failure later, as well as find out how
> >> many bad tag sets there were in the run. Generally the problem is a
> >> mangled
> >> log file. I probably won't be able to fix it anyway, so just skipping
> over
> >> the bad tag set is the best option.
> >>
> >> Skip
> >>
> >> On Mon, Nov 21, 2011 at 2:47 PM, Raul Miller <[email protected]>
> >> wrote:
> >>
> >> > That assert is checking for unbalanced tags.  You probably have two
> >> start
> >> > tags followed by one end tag.
> >> >
> >> > What do you want the program to do for this kind of thing?
> >> >
> >> > --
> >> > Raul
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to