Re: [Jprogramming] Finding multiple sequential strings

Skip Cave Fri, 18 Nov 2011 22:16:48 -0800

OK, here;s Raul's function:

getTagsContents=: 4 :0
 'n m'=. $tags=. > _2 <\ y
 locs=.  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
 locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs
  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
  data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
 expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs


Here's my data:

  textfile3=: 0 : 0
Start of event 1 crlftb
Some text crlftb
Some more text crlftb
param1    =  12345 crlftb
param2    =   NONE crlftb
some comments crlftb
param3   =   hello world crlftb
more comments crlftb
param4  =  120.45 crlftb
param6 = Test y crlftb
End of stuff crlftb
Start of event 2 crlftb
Some text crlftb
Some different text crlftb
param1  = 34567 crlftb
param3  = hello bob crlftb
param4  = 32.89 crlftb
comments and more comments crlftb
param5   - zero one crlftb
param 6  = Test z crlftb
Second end crlftb
Start of event 3 crlftb
param5 = two three crlftb
Start of event 4 crlftb
stuff crlftb
param1 = 6789 crlftb
param2 = SOME crlftb
end crlftb
)

   $ textfile3
642

I put the crlftb text on the end of each line, to show the equivalent of
the invisible characters that are actually in the real data.

Here is the string of tag pairs:

tags1 =: ('param1'; 'crlftb' ; 'param2'; 'crlftb' ; 'param3' ; 'crlftb' ;
'param5' ; 'crlftb' )
   tags1
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│param1│crlftb│param2│crlftb│param3│crlftb│param5│crlftb│
└──────┴──────┴──────┴──────┴──────┴──────┴──────┴──────┘

 so now we try Raul's function:

  txt5 =:  tags1 getTagsContents   textfile3
|domain error: getTagsContents
|   locs=.tags [email protected]:0}.txt=.(' ',;tags),x    ,;tags

Nope. What we should get is:

┌─────┬────┬───────────┬─────────┐
│12345│NONE│hello world│         │
├─────┼────┼───────────┼─────────┤
│34567│    │hello bob  │zero one │
├─────┼────┼───────────┼─────────┤
│     │    │           │two three│
├─────┼────┼───────────┼─────────┤
│6789 │SOME│           │         │
└─────┴────┴───────────┴─────────┘

 I should get a 4 x 4 boxed array with 12345, NONE, hello world, and an
empty box in the first row of boxes
The second row will have boxes containing 34567, empty box, hello bob, zero
one
The third row will have three empty boxes, and the fourth box will have
'two three' in it
The fourth row will have a box with 6789, a box with SOME, and two empty
boxes.

Skip


On Fri, Nov 18, 2011 at 12:10 PM, Raul Miller <[email protected]> wrote:

> By the way:
>
>  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
>
> This is defining two things:
>   txt
>   locs
>
> txt is the text supplied in x, but preceded and followed by an empty set of
> tags, and also ensuring that the very first character is not the start of a
> tag.
>
> The space may be unnecessary -- the only reason I am adding the space on
> the front is so that I know whether or not the first block of text
> delimited by tags is part of a tag or not.  (But I would have to re-think
> the rest of the code to see if it could work based on the idea that the
> start of the text is always a tag.)
>
> Meanwhile, I prepend and append an empty set of tags to avoid issues with
> how I am identifying "sequences of tags".  One of your requirements was
> that tags always appear in order, and if any are missing from that order we
> get a blank result in the corresponding result position.  And I am using
> J's diagonal function to find the subsequences.  And to ensure that found
> subsequences are always complete, I start and end the text with complete
> subsequences (otherwise, with certain data, I might
> have incomplete diagonals...).
>
> Anyways, once I have that, loc becomes the locations of the beginings of
> your start tags (left column) in txt and of the beginnings of your end tags
> (right column) in txt.
>
> Except, of course, end tags are not just defined by their text, but also by
> their location.  (Which makes me wonder if perhaps this code should instead
> be structured to just have a single "end" delimiter instead of trying to
> pretend that it's reasonable to have different ending tags for different
> starting tags.)
>
> Finally, I should have replace =: with =. in the function definition before
> releasing it.  (=: makes some kinds of debugging easier but can cause
> long-run problems).
>
>
> getTagsContents=: 4 :0
>  'n m'=. $tags=. > _2 <\ y
>   locs=.  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
>  locs=. (-@#@[ {. I. {./. ])&.>/\"1 locs
>   NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
>   data=. _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
>  expand=. ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
>   }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> )
>
>
> --
> Raul
>
>
>
>
> On Thu, Nov 17, 2011 at 6:54 PM, Skip Cave <[email protected]>
> wrote:
>
> > I think the problem we are having is that the closing crlftb tag string
> > appears in many other places in the file, besides as a closing tag for
> the
> > opening tags. There are many more crlftb strings in the text than there
> are
> > opening tag strings.
> >
> > So the correct statement is that the opening tags are unique, and will
> > always start a required text string. Closing tags are not necessarily
> > unique, and will close the required strings,  as well as appear in other
> > places in the file, which can be ignored. Only the first closing tag
> string
> > that appears in the text following a unique opening tag is valid as the
> > terminating tag for the text to be extracted.
> >
> > The function should find the opening tag, and then capture all of the
> text
> > up to the first occurrence of the crlftb closing tag. It should ignore
> all
> > subsequent crlftb tags until after it finds a unique opening tag, then
> > again capture all of the text up to the first tag end string, which in
> this
> > case will again be the crlftb string.
> >
> > It looks like the problem is in this line of the function:
> >  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> >
> > But I haven't gotten my head around all that it is doing as yet.
> >
> > Here's the whole function, with the assert line commented out.
> >
> > getTagsContents=: 4 :0
> >  'n m'=. $tags=. > _2 <\ y
> >  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
> >  NB. assert. -:&/:&;/ |:locs  NB. tags must be balanced
> >  data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
> >  expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
> >  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
> > )
> >
> > Skip
> >
> > On Thu, Nov 17, 2011 at 1:33 PM, Skip Cave <[email protected]>
> > wrote:
> >
> > > Yes. As I said in my previous post, the assert statement has been
> > > commented out. It would throw an error, if the assert wasn't commented
> > out.
> > > On Nov 17, 2011 12:00 PM, "Raul Miller" <[email protected]> wrote:
> > >
> > >> Did you try just removing the assert?
> > >>
> > >> Thanks,
> > >>
> > >> --
> > >> Raul
> > >>
> > >> On Thu, Nov 17, 2011 at 11:24 AM, Skip Cave <[email protected]
> > >wrote:
> > >>
> > >>> I stated that wrong.
> > >>>
> > >>>    (  ww2) getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > >>> ┌┬──────────────────────────────────────────────────────────┐
> > >>> ││_HOSTNAME           = Unknown <connected via resource mgr>│
> > >>> └┴──────────────────────────────────────────────────────────┘
> > >>>
> > >>> It doesn't find the first tag pair, and for some reason, it captures
> > >>> *part
> > >>> of* the string *following* the first tag pair.
> > >>>
> > >>> Skip
> > >>>
> > >>> On Thu, Nov 17, 2011 at 10:08 AM, Skip Cave <[email protected]
> >
> > >>> wrote:
> > >>>
> > >>> > Raul,
> > >>> >
> > >>> > In my application, the tag pairs will never overlap. Also, the
> > leading
> > >>> tag
> > >>> > of a particular string will always be unique. However, it is handy
> if
> > >>> I can
> > >>> > define just the trailing tags of any tag pair to be all the same
> > >>> string.
> > >>> > This won't always be the case, as sometimes the closing tag may be
> > >>> unique,
> > >>> > so either case should work. Here's an example of some real data in
> my
> > >>> text
> > >>> > log file. I just pulled a section of the text out of the middle of
> > the
> > >>> log:
> > >>> >
> > >>> >    ww2
> > >>> >
> > >>> >     PROMPT_DURATION           = 1.968
> > >>> >     STATUS                    = RECOGNITION
> > >>> >     SERVER_HOSTNAME           = Unknown <connected via resource
> mgr>
> > >>> >     NUM_RESULTS               = 1
> > >>> >     RESULT[0]                 = dtmf-9 dtmf-0 dtmf-4 dtmf-7 dtmf-2
> > >>> >     CONFIDENCE[0]
> > >>> >
> > >>> > Let's look at the data:
> > >>> >    $ ww2
> > >>> > 260
> > >>> >    q: 260
> > >>> > 2 2 5 13
> > >>> >
> > >>> > So 2*2*5 = 20 and ww2 will fit in a 20 x 13 array
> > >>> >
> > >>> >     13 20  $ a. i. ww2
> > >>> >  10   9  80  82  79  77 80  84  95  68  85  82  65  84  73  79  78
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  32  32 32  32  61  32  49  46  57  54  56  13  10
> > 9
> > >>> > 83  84
> > >>> >  65  84  85  83  32  32 32  32  32  32  32  32  32  32  32  32  32
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  61  32 82  69  67  79  71  78  73  84  73  79  78
> >  13
> > >>> > 10   9
> > >>> >  83  69  82  86  69  82 95  72  79  83  84  78  65  77  69  32  32
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  32  32 61  32  85 110 107 110 111 119 110  32  60
> >  99
> > >>> 111
> > >>> > 110
> > >>> > 110 101  99 116 101 100 32 118 105  97  32 114 101 115 111 117 114
> >  99
> > >>> > 101  32
> > >>> > 109 103 114  62  13  10  9  78  85  77  95  82  69  83  85  76  84
> >  83
> > >>> > 32  32
> > >>> >  32  32  32  32  32  32 32  32  32  32  32  32  32  61  32  49  13
> >  10
> > >>> > 9  82
> > >>> >  69  83  85  76  84  91 48  93  32  32  32  32  32  32  32  32  32
> >  32
> > >>> > 32  32
> > >>> >  32  32  32  32  32  61 32 100 116 109 102  45  57  32 100 116 109
> > 102
> > >>> > 45  48
> > >>> >  32 100 116 109 102  45 52  32 100 116 109 102  45  55  32 100 116
> > 109
> > >>> > 102  45
> > >>> >  50  13  10   9  67  79 78  70  73  68  69  78  67  69  91  48  93
> >  32
> > >>> > 32  32
> > >>> >
> > >>> > You can see that each line is terminated with a 13, 10, 9 character
> > >>> string
> > >>> > (CR, LF, TAB)
> > >>> >
> > >>> > we check:
> > >>> >
> > >>> >    a. i. crlftb
> > >>> > 13 10 9
> > >>> >
> > >>> > Also the crlftb noun contains the three characters that terminate
> > each
> > >>> > line.
> > >>> >
> > >>> > I want to capture the row starting with 'STATUS' and the row
> starting
> > >>> with
> > >>> > 'RESULT[0]',
> > >>> > Both rows terminate with the carriage return, line feed, tab
> > sequence.
> > >>> > I have commented the assert. statement out of your getTagsContent,
> so
> > >>> I no
> > >>> > longer get the error:
> > >>> >
> > >>> > Now I run the function:
> > >>> >
> > >>> >     ww2 getTagsContents 'STATUS';crlf;'RESULT[0]';crlf
> > >>> > ┌┬──────────────────────────────────────────────────────────┐
> > >>> > ││_HOSTNAME           = Unknown <connected via resource mgr>│
> > >>> > └┴──────────────────────────────────────────────────────────┘
> > >>> >
> > >>> > Weird. I get the line AFTER the one I want, and it completely
> misses
> > >>> the
> > >>> > second tag pair.
> > >>> > Any ideas what is going on?
> > >>> >
> > >>> > Skip
> > >>> >
> > >>> >
> > >>> >
> > >>> > On Thu, Nov 17, 2011 at 8:17 AM, Raul Miller <
> [email protected]
> > >>> >wrote:
> > >>> >
> > >>> >> P.S. Brian Schott has observed that if you take out the assertion
> > >>> that is
> > >>> >> checking for balanced tags that works on the crlftab example.
> > >>> >>
> > >>> >> This is because my code also has an assumption that tags cannot
> > >>> overlap.
> > >>> >>
> > >>> >> I have not thought through what this would mean for other cases
> that
> > >>> would
> > >>> >> currently be rejected.
> > >>>
> > >>>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>



-- 
Skip Cave
Cave Consulting LLC
Phone: 214-460-4861
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to