There are some unmentioned issues that may trip you up eventually with
this approach, for example, if you try to apply these routines to the
text of Finnegan's Wake.

To hint at those issues, here's an approach that takes you directly to
the final result:

   ex1=: <'This is Skip''s test. Testing one, two, three. Count 3, 2, 1.'

DELIM=:'.?!'
toss=:a.#~1-(a.e.DELIM,":i.10)+.(tolower~:toupper) a.
separateclean=:3 :0
  a:-.~(e.&DELIM <@deb;._2 tolower) '.',~(;y) -. toss
)

   separateclean ex1
┌──────────────────┬─────────────────────┬───────────┐
│this is skips test│testing one two three│count 3 2 1│
└──────────────────┴─────────────────────┴───────────┘


And here's a longer approach which takes you there in two steps where
the result of the first step will be the same length as the result of
the second step:

separatedirty=:3 :0
  (;:'.')-.~(e.&DELIM <@deb;.2 ]) '.',~;y
)
clean=: tolower@-.&(toss,DELIM) L:0

   separatedirty ex1
┌────────────────────┬────────────────────────┬──────────────┐
│This is Skip's test.│Testing one, two, three.│Count 3, 2, 1.│
└────────────────────┴────────────────────────┴──────────────┘
   clean separatedirty ex1
┌──────────────────┬─────────────────────┬───────────┐
│this is skips test│testing one two three│count 3 2 1│
└──────────────────┴─────────────────────┴───────────┘


But with ill conditioned text (Finnegan's Wake being an example of
that), I expect cases where separateclean gives a different result
from clean@separatedirty

But that's what makes text fun...

-- 
Raul


On Wed, Apr 4, 2018 at 12:02 PM, Skip Cave <s...@caveconsulting.com> wrote:
> I have the following boxed data:
>
> ex1=. <'This is Skip''s test. Testing one, two, three. Count 3, 2, 1.'
>
>
> ex1
>
> ┌────────────────────────────────────────────────────────────┐
>
> │This is Skip's test. Testing one, two, three. Count 3, 2, 1.│
>
> └────────────────────────────────────────────────────────────┘
>
> I want to build a verb that will separate this boxed text data into
> sentences.
>
>
> ex2=. (<'This is Skip''s test.'),(<'Testing one, two, three.'),(<'Count 3,
> 2, 1.')
>
> ex2
>
> ┌────────────────────┬────────────────────────┬──────────────┐
>
> │This is Skip's test.│Testing one, two, three.│Count 3, 2, 1.│
>
> └────────────────────┴────────────────────────┴──────────────┘
>
> I also want to get rid of all punctuation and caps:
>
> ex3=. (<'this is skips test'),(<'testing one two three'),(<'count 3 2 1')
>
> ex3
>
> ┌──────────────────┬─────────────────────┬───────────┐
>
> │this is skips test│testing one two three│count 3 2 1│
>
> └──────────────────┴─────────────────────┴───────────┘
>
> What is a reasonable J verb to do this separation and cleanup?
>
> Skip
>
> Cave Consulting LLC
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to