Re: [Jprogramming] Finding multiple sequential strings

Raul Miller Mon, 14 Nov 2011 13:47:50 -0800

Ok... I hope I am not overlooking anything here:

getTagsContents=: 4 :0
  'n m'=. $tags=. > _2 <\ y
  locs=:  tags [email protected]:0 }. txt=.(' ',;tags),x,;tags
  assert. -:&/:&;/ |:locs  NB. tags must be balanced
  data=: _2 {:\  ((/:~ ; locs) I. i.#txt) </.  txt
  expand=: ;(#~ 1&e.S:0) <@|./. |.> (e.L:0~ /:~@;) {."1 locs
  }: }.(#@>{."1 tags) }.&.>"1 (-n) ]\ expand #inv (+/expand){.data
)


If you can guarantee that tags are always balanced, you can get rid of
the assert statement.

How this works:

1. tagged data blocks are extracted from the text (in their original order)
2. expand is defined to be the compression vector on the ravel of the
desired result, to get those blocks
3. expand #inv data gets the blocks we need

Everything else is busywork to convert between data formats and
representations.

To make my work easier, I make sure that:

a. There is always text to be discarded before the first tag
b. the full set of tags appear at the beginning of the text I am working
with
c. the full set of tags appear at the end of the text I am working with

(these boundary guards are discarded from the final result).

Note: I hope that this is readable -- for some reason gmail has recently
taken
to mutilating line-ends on plain text messages, so I do not know how I can
send plain text code.

FYI,

-- 
Raul

On Mon, Nov 14, 2011 at 3:04 PM, Skip Cave <[email protected]> wrote:
>  Raul Miller <[email protected]> wrote:
> Sorry about that, I should stop posting so carelessly:
>
>  getTagsContents=: getTagContents~S:0 1    <\~&_2
>
> <<>>
>
> Skip repies:
>
> OK, that is getting better.
>
>  getTagsContents=: getTagContents~S:0 1    <\~&_2
>
>    ftxt2 =: textfile1 getTagsContents 'tag1s';'tag1e';'tag2s';'tag2e'
>   $ ftxt2
> 2 3
>   ftxt2
>
┌────────────────────────────────┬──────────────────────────┬───────────────┐
> │ good stuff that I want to keep │ even more stuff I want   │ stuff to
keep
> │
>
├────────────────────────────────┼──────────────────────────┼───────────────┤
> │ more good stuff that I need    │ really really good stuff │
> │
>
└────────────────────────────────┴──────────────────────────┴───────────────┘
>
> So the middle row of untagged text is gone, which is a big step in the
> right direction. We have captured al the tag pairs, and there is an empty
> box telling me that there is a missing tag pair. Now the transpose of
ftxt2
> is getting really close:
>
>  |: ftxt2
> ┌────────────────────────────────┬─────────────────────────────┐
> │ good stuff that I want to keep │ more good stuff that I need │
> ├────────────────────────────────┼─────────────────────────────┤
> │ even more stuff I want         │ really really good stuff    │
> ├────────────────────────────────┼─────────────────────────────┤
> │ stuff to keep                  │                             │
> └────────────────────────────────┴─────────────────────────────┘
>
> Each column represent one tag pair. However, the sequence of the extracted
> strings is still out of order.
>
> As I stated in the my previous post, the ravel order of the boxes in the
> result of getTagsContents must be in the order that the strings were in
the
> original text.
>
> so
>
>    ,ftxt2
>
>
┌────────────────────────────────┬────────────────────────┬───────────────┬─────────────────────────────┬──────────────────────────┬┐
> │ good stuff that I want to keep │ even more stuff I want │ stuff to keep
│
> more good stuff that I need │ really really good stuff ││
>
└────────────────────────────────┴────────────────────────┴───────────────┴─────────────────────────────┴──────────────────────────┴┘
>  This is way out of order from the original text.
>
> we try
>
>    , |: ftxt2
>
>
┌────────────────────────────────┬─────────────────────────────┬────────────────────────┬──────────────────────────┬───────────────┬┐
> │ good stuff that I want to keep │ more good stuff that I need │ even more
> stuff I want │ really really good stuff │ stuff to keep ││
>
└────────────────────────────────┴─────────────────────────────┴────────────────────────┴──────────────────────────┴───────────────┴┘
>
> This is much better, but the empty box should be the third-to-last box in
> the raveled list, not the last box, to keep the strings in the same order
> as they were in the orginial text. The missing tag2 pair comes just before
> the last tag1 and tag2 pairs,
>
> I tried several schemes to get the transpose in the getTagsContents
> function, but none worked:
>
> getTagsContents=:  |: getTagContents~S:0 1    <\~&_2
>     ftxt3 =: textfile1 getTagsContents 'tag1s';'tag1e';'tag2s';'tag2e'
> |domain error: getTagsContents
> |   ftxt3=:textfile1     getTagsContents'tag1s';'tag1e';'tag2s';'tag2e'
>
>
> In any case, the approach you are using doesn't keep the string sequence,
> including the missing strings, in the same order as the original text. It
> looks like your function finds all the tag1 strings first, and puts them
in
> a boxed array. Then it finds all the tag2 strings, and adds them to the
> array, filling out mismatched quantities with empty boxes. Your method
> finds all the tagged strings, but it looses the sequence order of the
> strings.
>
> In my text files, every tag1 string should be followed by a tag2 string,
in
> the order 1, 2, 1, 2, etc, If either a tag1 string or a tag2 string is
> missing from that sequence, that missing string should be represented by
an
> empty box, placed in the exact sequence that it occurred in the original
> text. It is very important to place the empty box in the same place in the
> sequence where the missing string was, in the original text. This type of
> function is common when parsing log files (which is what I am doing), as
> missing lines in the log are important to know, but their sequential
> position in relation to the other tagged strings is just as important.
>
> Here's a simpler, but more thorough, test example:
>
>      textfile2=: 0 : 0
> stuff1
> tag1s stuff2 tag1e
> stuff3
> tag2s stuff4 tag2e
> stuff5
> tag1s stuff6 tag1e
> stuff7
> tag1s stuff8 tag1e
> stuff9
> tag2s stuff10 tag2e
> stuff11
> tag2s stuff12 tag2e
> )
>
> Here is an example of how the function should work (this is just a mockup,
> as I haven't gotten the function working as yet):
>
>    ftxt4 =: textfile2 NewgetTagsContents 'tag1s';'tag1e';'tag2s';'tag2e'
>
>   $ ftxt4
> 4 2
>
>   ftxt4
> ┌──────┬───────┐
> │stuff2│stuff4 │
> ├──────┼───────┤
> │stuff6│       │
> ├──────┼───────┤
> │stuff8│stuff10│
> ├──────┼───────┤
> │      │stuff12│
> └──────┴───────┘
>
> The first column represents all the tag1 pairs, and the second column
> represents all the tag2 pairs. The empty box in row two, indicates that
> there was no tag2 pair following the second tag1 pair. Also the empty box
> in the first column of the last row indicates that there was a missing
tag1
> pair, before the last tag2 pair.
>
> The ravel of text4 shows all the strings in their original order,
> 1,2,1,2,1,2, with the empty boxes showing the missing pairs in sequence:
>
>   , ftxt4
> ┌──────┬──────┬──────┬┬──────┬───────┬┬───────┐
> │stuff2│stuff4│stuff6││stuff8│stuff10││stuff12│
> └──────┴──────┴──────┴┴──────┴───────┴┴───────┘
>
> This is the function I'm trying to develop. Ideally the function would
> scale to N tag pairs, creating an N by 2  boxed array, with the tag pairs
> always in the same sequence.
>
> Skip
>
> On Mon, Nov 14, 2011 at 11:56 AM, Raul Miller <[email protected]>
wrote:
>
>> Sorry about that, I should stop posting so carelessly:
>>
>>  getTagsContents=: getTagContents~S:0 1    <\~&_2
>>
>> --
>> Raul
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding multiple sequential strings

Reply via email to