Re: [Jprogramming] Finding multiple sequential strings

Raul Miller Sat, 19 Nov 2011 05:49:42 -0800

getTagsContents=: 4 :0
  'n m'=. $tags=. > _2 <\ y
  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
  locs=: (-@#@[ {. I. {./. ])&.>/\"1 locs
  assert. -:&/:&;/ |:locs  NB. tags must be balanced
  data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
  expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)


tags1 =: ('param1'; 'crlftb' ; 'param2'; 'crlftb' ; 'param3' ; 'crlftb'
;'param5' ; 'crlftb' )

(and textfile3 from Skip's message, below):

   textfile3 getTagsContents tags1
┌──────────────┬─────────────┬───────────────────┬──────────────┐
│    =  12345  │    =   NONE │   =   hello world │              │
├──────────────┼─────────────┼───────────────────┼──────────────┤
│              │             │  = 34567          │  = hello bob │
├──────────────┼─────────────┼───────────────────┼──────────────┤
│   - zero one │             │                   │ = two three  │
├──────────────┼─────────────┼───────────────────┼──────────────┤
│ = 6789       │ = SOME      │                   │              │
└──────────────┴─────────────┴───────────────────┴──────────────┘

Note that I wrote it so that the flat text is the left argument and the
tags are the right argument.

FYI,

-- 
Raul



On Sat, Nov 19, 2011 at 1:15 AM, Skip Cave <[email protected]> wrote:

> OK, here;s Raul's function:
>
> getTagsContents=: 4 :0
>  'n m'=. $tags=. > _2 <\ y
>  locs=.  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
>  locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs
>  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
>  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>  expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
>
> Here's my data:
>
>  textfile3=: 0 : 0
> Start of event 1 crlftb
> Some text crlftb
> Some more text crlftb
> param1    =  12345 crlftb
> param2    =   NONE crlftb
> some comments crlftb
> param3   =   hello world crlftb
> more comments crlftb
> param4  =  120.45 crlftb
> param6 = Test y crlftb
> End of stuff crlftb
> Start of event 2 crlftb
> Some text crlftb
> Some different text crlftb
> param1  = 34567 crlftb
> param3  = hello bob crlftb
> param4  = 32.89 crlftb
> comments and more comments crlftb
> param5   - zero one crlftb
> param 6  = Test z crlftb
> Second end crlftb
> Start of event 3 crlftb
> param5 = two three crlftb
> Start of event 4 crlftb
> stuff crlftb
> param1 = 6789 crlftb
> param2 = SOME crlftb
> end crlftb
> )
>
>   $ textfile3
> 642
>
> I put the crlftb text on the end of each line, to show the equivalent of
> the invisible characters that are actually in the real data.
>
> Here is the string of tag pairs:
>
> tags1 =: ('param1'; 'crlftb' ; 'param2'; 'crlftb' ; 'param3' ; 'crlftb' ;
> 'param5' ; 'crlftb' )
>   tags1
> ┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
> │param1│crlftb│param2│crlftb│param3│crlftb│param5│crlftb│
> └──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘
>
>  so now we try Raul's function:
>
>  txt5 =:  tags1 getTagsContents   textfile3
> |domain error: getTagsContents
> |   locs=.tags [email protected]:0}.txt=.(' ',;tags),x    ,;tags
>
> Nope. What we should get is:
>
> ┌─────┬────┬───────────┬─────────┐
> │12345│NONE│hello world│         │
> ├─────┼────┼───────────┼─────────┤
> │34567│    │hello bob  │zero one │
> ├─────┼────┼───────────┼─────────┤
> │     │    │           │two three│
> ├─────┼────┼───────────┼─────────┤
> │6789 │SOME│           │         │
> └─────┴────┴───────────┴─────────┘
>
>  I should get a 4 x 4 boxed array with 12345, NONE, hello world, and an
> empty box in the first row of boxes
> The second row will have boxes containing 34567, empty box, hello bob, zero
> one
> The third row will have three empty boxes, and the fourth box will have
> 'two three' in it
> The fourth row will have a box with 6789, a box with SOME, and two empty
> boxes.
>
> Skip
>
>
> On Fri, Nov 18, 2011 at 12:10 PM, Raul Miller <[email protected]>
> wrote:
>
> > By the way:
> >
> >  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> >
> > This is defining two things:
> >   txt
> >   locs
> >
> > txt is the text supplied in x, but preceded and followed by an empty set
> of
> > tags, and also ensuring that the very first character is not the start
> of a
> > tag.
> >
> > The space may be unnecessary -- the only reason I am adding the space on
> > the front is so that I know whether or not the first block of text
> > delimited by tags is part of a tag or not.  (But I would have to re-think
> > the rest of the code to see if it could work based on the idea that the
> > start of the text is always a tag.)
> >
> > Meanwhile, I prepend and append an empty set of tags to avoid issues with
> > how I am identifying "sequences of tags".  One of your requirements was
> > that tags always appear in order, and if any are missing from that order
> we
> > get a blank result in the corresponding result position.  And I am using
> > J's diagonal function to find the subsequences.  And to ensure that found
> > subsequences are always complete, I start and end the text with complete
> > subsequences (otherwise, with certain data, I might
> > have incomplete diagonals...).
> >
> > Anyways, once I have that, loc becomes the locations of the beginings of
> > your start tags (left column) in txt and of the beginnings of your end
> tags
> > (right column) in txt.
> >
> > Except, of course, end tags are not just defined by their text, but also
> by
> > their location.  (Which makes me wonder if perhaps this code should
> instead
> > be structured to just have a single "end" delimiter instead of trying to
> > pretend that it's reasonable to have different ending tags for different
> > starting tags.)
> >
> > Finally, I should have replace =: with =. in the function definition
> before
> > releasing it.  (=: makes some kinds of debugging easier but can cause
> > long-run problems).
> >
> >
> > getTagsContents=: 4 :0
> >  'n m'=. $tags=. > _2 <\ y
> >   locs=.  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> >  locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs
> >   NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
> >   data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
> >  expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
> >   }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > )
> >
> >
> > --
> > Raul
> >
> >
> >
> >
> > On Thu, Nov 17, 2011 at 6:54 PM, Skip Cave <[email protected]>
> > wrote:
> >
> > > I think the problem we are having is that the closing crlftb tag string
> > > appears in many other places in the file, besides as a closing tag for
> > the
> > > opening tags. There are many more crlftb strings in the text than there
> > are
> > > opening tag strings.
> > >
> > > So the correct statement is that the opening tags are unique, and will
> > > always start a required text string. Closing tags are not necessarily
> > > unique, and will close the required strings,  as well as appear in
> other
> > > places in the file, which can be ignored. Only the first closing tag
> > string
> > > that appears in the text following a unique opening tag is valid as the
> > > terminating tag for the text to be extracted.
> > >
> > > The function should find the opening tag, and then capture all of the
> > text
> > > up to the first occurrence of the crlftb closing tag. It should ignore
> > all
> > > subsequent crlftb tags until after it finds a unique opening tag, then
> > > again capture all of the text up to the first tag end string, which in
> > this
> > > case will again be the crlftb string.
> > >
> > > It looks like the problem is in this line of the function:
> > >  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> > >
> > > But I haven't gotten my head around all that it is doing as yet.
> > >
> > > Here's the whole function, with the assert line commented out.
> > >
> > > getTagsContents=: 4 :0
> > >  'n m'=. $tags=. > _2 <\ y
> > >  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> > >  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
> > >  data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
> > >  expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
> > >  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > > )
> > >
> > > Skip
> > >
> > > On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]>
> > > wrote:
> > >
> > > > Yes. As I said in my previous post, the assert statement has been
> > > > commented out. It would throw an error, if the assert wasn't
> commented
> > > out.
> > > > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]>
> wrote:
> > > >
> > > >> Did you try just removing the assert?
> > > >>
> > > >> Thanks,
> > > >>
> > > >> --
> > > >> Raul
> > > >>
> > > >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <
> [email protected]
> > > >wrote:
> > > >>
> > > >>> I stated that wrong.
> > > >>>
> > > >>>    (  ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > > >>> ┌┬──────────────────────────────────────────────────────────┐
> > > >>> ││_HOSTNAME           = Unknown <connected via resource mgr>│
> > > >>> └┴──────────────────────────────────────────────────────────┘
> > > >>>
> > > >>> It doesn't find the first tag pair, and for some reason, it
> captures
> > > >>> *part
> > > >>> of* the string *following* the first tag pair.
> > > >>>
> > > >>> Skip
> > > >>>
> > > >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <
> [email protected]
> > >
> > > >>> wrote:
> > > >>>
> > > >>> > Raul,
> > > >>> >
> > > >>> > In my application, the tag pairs will never overlap. Also, the
> > > leading
> > > >>> tag
> > > >>> > of a particular string will always be unique. However, it is
> handy
> > if
> > > >>> I can
> > > >>> > define just the trailing tags of any tag pair to be all the same
> > > >>> string.
> > > >>> > This won't always be the case, as sometimes the closing tag may
> be
> > > >>> unique,
> > > >>> > so either case should work. Here's an example of some real data
> in
> > my
> > > >>> text
> > > >>> > log file. I just pulled a section of the text out of the middle
> of
> > > the
> > > >>> log:
> > > >>> >
> > > >>> >    ww2
> > > >>> >
> > > >>> >     PROMPT_DURATION           = 1.968
> > > >>> >     STATUS                    = RECOGNITION
> > > >>> >     SERVER_HOSTNAME           = Unknown <connected via resource
> > mgr>
> > > >>> >     NUM_RESULTS               = 1
> > > >>> >     RESULT[0]                 = dtmf-9 dtmf-0 dtmf-4 dtmf-7
> dtmf-2
> > > >>> >     CONFIDENCE[0]
> > > >>> >
> > > >>> > Let's look at the data:
> > > >>> >    $ ww2
> > > >>> > 260
> > > >>> >    q: 260
> > > >>> > 2 2 5 13
> > > >>> >
> > > >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array
> > > >>> >
> > > >>> >     13 20  $ a. i. ww2
> > > >>> >  10   9  80  82  79  77 80  84  95  68  85  82  65  84  73  79
>  78
> > >  32
> > > >>> > 32  32
> > > >>> >  32  32  32  32  32  32 32  32  61  32  49  46  57  54  56  13
>  10
> > > 9
> > > >>> > 83  84
> > > >>> >  65  84  85  83  32  32 32  32  32  32  32  32  32  32  32  32
>  32
> > >  32
> > > >>> > 32  32
> > > >>> >  32  32  32  32  61  32 82  69  67  79  71  78  73  84  73  79
>  78
> > >  13
> > > >>> > 10   9
> > > >>> >  83  69  82  86  69  82 95  72  79  83  84  78  65  77  69  32
>  32
> > >  32
> > > >>> > 32  32
> > > >>> >  32  32  32  32  32  32 61  32  85 110 107 110 111 119 110  32
>  60
> > >  99
> > > >>> 111
> > > >>> > 110
> > > >>> > 110 101  99 116 101 100 32 118 105  97  32 114 101 115 111 117
> 114
> > >  99
> > > >>> > 101  32
> > > >>> > 109 103 114  62  13  10  9  78  85  77  95  82  69  83  85  76
>  84
> > >  83
> > > >>> > 32  32
> > > >>> >  32  32  32  32  32  32 32  32  32  32  32  32  32  61  32  49
>  13
> > >  10
> > > >>> > 9  82
> > > >>> >  69  83  85  76  84  91 48  93  32  32  32  32  32  32  32  32
>  32
> > >  32
> > > >>> > 32  32
> > > >>> >  32  32  32  32  32  61 32 100 116 109 102  45  57  32 100 116
> 109
> > > 102
> > > >>> > 45  48
> > > >>> >  32 100 116 109 102  45 52  32 100 116 109 102  45  55  32 100
> 116
> > > 109
> > > >>> > 102  45
> > > >>> >  50  13  10   9  67  79 78  70  73  68  69  78  67  69  91  48
>  93
> > >  32
> > > >>> > 32  32
> > > >>> >
> > > >>> > You can see that each line is terminated with a 13, 10, 9
> character
> > > >>> string
> > > >>> > (CR, LF, TAB)
> > > >>> >
> > > >>> > we check:
> > > >>> >
> > > >>> >    a. i. crlftb
> > > >>> > 13 10 9
> > > >>> >
> > > >>> > Also the crlftb noun contains the three characters that terminate
> > > each
> > > >>> > line.
> > > >>> >
> > > >>> > I want to capture the row starting with 'STATUS' and the row
> > starting
> > > >>> with
> > > >>> > 'RESULT[0]',
> > > >>> > Both rows terminate with the carriage return, line feed, tab
> > > sequence.
> > > >>> > I have commented the assert. statement out of your
> getTagsContent,
> > so
> > > >>> I no
> > > >>> > longer get the error:
> > > >>> >
> > > >>> > Now I run the function:
> > > >>> >
> > > >>> >     ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > > >>> > ┌┬──────────────────────────────────────────────────────────┐
> > > >>> > ││_HOSTNAME           = Unknown <connected via resource mgr>│
> > > >>> > └┴──────────────────────────────────────────────────────────┘
> > > >>> >
> > > >>> > Weird. I get the line AFTER the one I want, and it completely
> > misses
> > > >>> the
> > > >>> > second tag pair.
> > > >>> > Any ideas what is going on?
> > > >>> >
> > > >>> > Skip
> > > >>> >
> > > >>> >
> > > >>> >
> > > >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <
> > [email protected]
> > > >>> >wrote:
> > > >>> >
> > > >>> >> P.S. Brian Schott has observed that if you take out the
> assertion
> > > >>> that is
> > > >>> >> checking for balanced tags that works on the crlftab example.
> > > >>> >>
> > > >>> >> This is because my code also has an assumption that tags cannot
> > > >>> overlap.
> > > >>> >>
> > > >>> >> I have not thought through what this would mean for other cases
> > that
> > > >>> would
> > > >>> >> currently be rejected.
> > > >>>
> > > >>>
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
>
> --
> Skip Cave
> Cave Consulting LLC
> Phone: 214-460-4861
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to