Re: [Jprogramming] Finding multiple sequential strings

Raul Miller Fri, 18 Nov 2011 08:43:54 -0800

Not really.

Ok, true, both are about tagged text.  But xml deals with overlapping
balanced tags mixed in with structures other than tags.  Here, the tags
cannot overlap, are not balanced, and no other structures are supported.


In other words, this is like xml in the way that a sidewalk is like a
pyramid.

-- 
Raul

2011/11/18 Björn Helgason <[email protected]>

> Sounds like an exercise in early xml
>
> 2011/11/17 Skip Cave <[email protected]>
>
> > I think the problem we are having is that the closing crlftb tag string
> > appears in many other places in the file, besides as a closing tag for
> the
> > opening tags. There are many more crlftb strings in the text than there
> are
> > opening tag strings.
> >
> > So the correct statement is that the opening tags are unique, and will
> > always start a required text string. Closing tags are not necessarily
> > unique, and will close the required strings,  as well as appear in other
> > places in the file, which can be ignored. Only the first closing tag
> string
> > that appears in the text following a unique opening tag is valid as the
> > terminating tag for the text to be extracted.
> >
> > The function should find the opening tag, and then capture all of the
> text
> > up to the first occurrence of the crlftb closing tag. It should ignore
> all
> > subsequent crlftb tags until after it finds a unique opening tag, then
> > again capture all of the text up to the first tag end string, which in
> this
> > case will again be the crlftb string.
> >
> > It looks like the problem is in this line of the function:
> >  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> >
> > But I haven't gotten my head around all that it is doing as yet.
> >
> > Here's the whole function, with the assert line commented out.
> >
> > getTagsContents=: 4 :0
> >  'n m'=. $tags=. > _2 <\ y
> >  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> >  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
> >  data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
> >  expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
> >  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > )
> >
> > Skip
> >
> > On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]>
> > wrote:
> >
> > > Yes. As I said in my previous post, the assert statement has been
> > > commented out. It would throw an error, if the assert wasn't commented
> > out.
> > > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]> wrote:
> > >
> > >> Did you try just removing the assert?
> > >>
> > >> Thanks,
> > >>
> > >> --
> > >> Raul
> > >>
> > >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <[email protected]
> > >wrote:
> > >>
> > >>> I stated that wrong.
> > >>>
> > >>>    (  ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > >>> ┌┬──────────────────────────────────────────────────────────┐
> > >>> ││_HOSTNAME           = Unknown <connected via resource mgr>│
> > >>> └┴──────────────────────────────────────────────────────────┘
> > >>>
> > >>> It doesn't find the first tag pair, and for some reason, it captures
> > >>> *part
> > >>> of* the string *following* the first tag pair.
> > >>>
> > >>> Skip
> > >>>
> > >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <[email protected]
> >
> > >>> wrote:
> > >>>
> > >>> > Raul,
> > >>> >
> > >>> > In my application, the tag pairs will never overlap. Also, the
> > leading
> > >>> tag
> > >>> > of a particular string will always be unique. However, it is handy
> if
> > >>> I can
> > >>> > define just the trailing tags of any tag pair to be all the same
> > >>> string.
> > >>> > This won't always be the case, as sometimes the closing tag may be
> > >>> unique,
> > >>> > so either case should work. Here's an example of some real data in
> my
> > >>> text
> > >>> > log file. I just pulled a section of the text out of the middle of
> > the
> > >>> log:
> > >>> >
> > >>> >    ww2
> > >>> >
> > >>> >     PROMPT_DURATION           = 1.968
> > >>> >     STATUS                    = RECOGNITION
> > >>> >     SERVER_HOSTNAME           = Unknown <connected via resource
> mgr>
> > >>> >     NUM_RESULTS               = 1
> > >>> >     RESULT[0]                 = dtmf-9 dtmf-0 dtmf-4 dtmf-7 dtmf-2
> > >>> >     CONFIDENCE[0]
> > >>> >
> > >>> > Let's look at the data:
> > >>> >    $ ww2
> > >>> > 260
> > >>> >    q: 260
> > >>> > 2 2 5 13
> > >>> >
> > >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array
> > >>> >
> > >>> >     13 20  $ a. i. ww2
> > >>> >  10   9  80  82  79  77 80  84  95  68  85  82  65  84  73  79  78
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  32  32 32  32  61  32  49  46  57  54  56  13  10
> > 9
> > >>> > 83  84
> > >>> >  65  84  85  83  32  32 32  32  32  32  32  32  32  32  32  32  32
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  61  32 82  69  67  79  71  78  73  84  73  79  78
> >  13
> > >>> > 10   9
> > >>> >  83  69  82  86  69  82 95  72  79  83  84  78  65  77  69  32  32
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  32  32 61  32  85 110 107 110 111 119 110  32  60
> >  99
> > >>> 111
> > >>> > 110
> > >>> > 110 101  99 116 101 100 32 118 105  97  32 114 101 115 111 117 114
> >  99
> > >>> > 101  32
> > >>> > 109 103 114  62  13  10  9  78  85  77  95  82  69  83  85  76  84
> >  83
> > >>> > 32  32
> > >>> >  32  32  32  32  32  32 32  32  32  32  32  32  32  61  32  49  13
> >  10
> > >>> > 9  82
> > >>> >  69  83  85  76  84  91 48  93  32  32  32  32  32  32  32  32  32
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  32  61 32 100 116 109 102  45  57  32 100 116 109
> > 102
> > >>> > 45  48
> > >>> >  32 100 116 109 102  45 52  32 100 116 109 102  45  55  32 100 116
> > 109
> > >>> > 102  45
> > >>> >  50  13  10   9  67  79 78  70  73  68  69  78  67  69  91  48  93
> >  32
> > >>> > 32  32
> > >>> >
> > >>> > You can see that each line is terminated with a 13, 10, 9 character
> > >>> string
> > >>> > (CR, LF, TAB)
> > >>> >
> > >>> > we check:
> > >>> >
> > >>> >    a. i. crlftb
> > >>> > 13 10 9
> > >>> >
> > >>> > Also the crlftb noun contains the three characters that terminate
> > each
> > >>> > line.
> > >>> >
> > >>> > I want to capture the row starting with 'STATUS' and the row
> starting
> > >>> with
> > >>> > 'RESULT[0]',
> > >>> > Both rows terminate with the carriage return, line feed, tab
> > sequence.
> > >>> > I have commented the assert. statement out of your getTagsContent,
> so
> > >>> I no
> > >>> > longer get the error:
> > >>> >
> > >>> > Now I run the function:
> > >>> >
> > >>> >     ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > >>> > ┌┬──────────────────────────────────────────────────────────┐
> > >>> > ││_HOSTNAME           = Unknown <connected via resource mgr>│
> > >>> > └┴──────────────────────────────────────────────────────────┘
> > >>> >
> > >>> > Weird. I get the line AFTER the one I want, and it completely
> misses
> > >>> the
> > >>> > second tag pair.
> > >>> > Any ideas what is going on?
> > >>> >
> > >>> > Skip
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <
> [email protected]
> > >>> >wrote:
> > >>> >
> > >>> >> P.S. Brian Schott has observed that if you take out the assertion
> > >>> that is
> > >>> >> checking for balanced tags that works on the crlftab example.
> > >>> >>
> > >>> >> This is because my code also has an assumption that tags cannot
> > >>> overlap.
> > >>> >>
> > >>> >> I have not thought through what this would mean for other cases
> that
> > >>> would
> > >>> >> currently be rejected.
> > >>>
> > >>>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
>
> --
> Björn Helgason, Verkfræðingur
> Fornustekkum II
> 781 Hornafirði,
> t-póst: [email protected]
> gsm: +3546985532
> twitter: @flugfiskur
> http://groups.google.com/group/J-Programming
>
>
> Tæknikunnátta höndlar hið flókna, sköpunargáfa er meistari einfaldleikans
>
> góður kennari getur stigið á tær án þess að glansinn fari af skónum
>          /|_      .-----------------------------------.
>         ,'  .\  /  | Með léttri lund verður        |
>     ,--'    _,'   | Dagurinn í dag                     |
>    /       /       | Enn betri en gærdagurinn  |
>   (   -.  |        `-----------------------------------'
>   |     ) |         (\_ _/)
>  (`-.  '--.)       (='.'=)   ♖♘♗♕♔♙
>   `. )----'        (")_(") ☃☠
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to