Re: [Jprogramming] Finding repeated substrings

Raul Miller Tue, 04 Jun 2013 11:34:11 -0700

Part 1. O(n)? really?

My understanding of "This problem can be solved in linear time" is
dubious. I feel that big-O calculations are often insufficiently
rigorous.

For example, consider some code which loops over a string, counting
how many times a particular letter appears in that string. It's common
to claim that this can be solved in linear time. But typical
understandings of the meaning of this problem make it O(n*log n) or
O(1).

The information produced by this algorithm requires log n bits to
represent. And, there are n operations on the counter which
accumulates those values. So, worse case, the limit on time to perform
the calculations is proportional to n*log n.

On the other hand, if we use fixed with numbers to represent the count
produced by the algorithm, there's a constant bound on the time needed
to perform the calculations, so it's O(1). And there's something
similar happening with the index into the string which we use to
select the character we are inspecting.

But if we define our hardware "properly" we can do this in linear
time. To do this, we define both the input and the output as "tape".
The "counting" operation consists of advancing the output tape. Now we
have linear time to scan the string, and linear time to build the
count. But we also have linear time to consume the count - there's no
motivation here for this algorithm to exist independent of its
consumer.

Anyways, if this is what is meant by "linear time" it's not
particularly relevant to modern machines. (Though the "tape model"
still does have some relevance - in some sense it approximates the
time needed for random access given the size of the data set. We have
architectures with registers, cache lines, L1 and L2 cache, main
memory, mass storage, and cloud storage - the bigger the data set the
longer it takes to randomly access data, and "tape" in some sense
approximates this concept.)

Recap: I am dubious about the rigor of this "linear time" claim - I
think a rigorous exposition of this claim would have to spell out
assumptions which require a machine with different time
characteristics from what the typical programmer would expect. (But I
still would be interested in reading about the algorithm.)

Part 2. Counting Squares

Problems like this, where we have simultaneously identifying unknown
subsequences and the sequential relationship between them, are outside
my normal realm of activity. So there are probably better ways of
doing this.

For example, everything in the for loop, where I'm appending to r,
conceptually can be a function F with the for loop replaced with
  ; ({."1 lumps) <@F/. lumps

Also "lump" and "lumps" are particularly non-descriptive names.
There's probably names in the literature on this subject which would
be meaningful. I should probably be defining helper functions, here,
rather than bundling everything into one big pile of code.

Still, this does find the squares, and I've already spent more time
today thinking about this issue than I had planned on...

squarest=:3 :0
  lumps=. /:~ ; y <@(}: ,.~ 1 -~ 2 -~/\ ])/. i.#y
  r=. i.0 0
  for_box.   ({."1 lumps) </. lumps do.
    lump=. >box
    len=. (<0 0) { lump
    keep=. (len#1) E. 0,~2 -~/\ 1 {"1 lump
    r=. r,keep#lump
  end.
  ~.({:"1 r) {&y@(+ i.)&.> 2*1+{."1 r
)

FYI,

-- 
Raul

On Tue, Jun 4, 2013 at 5:52 AM, Mike Müller <[email protected]> wrote:
> I asked about this problem in #jsoftware on Freenode, where it was mentioned 
> to me that this might be of broader interest,
> so I will repeat my question here and hope to get some other interested in 
> this topic.
>
> I am interested in finding squares in strings. A square is a string of the 
> form uu, for some string u.
> For instance, 'bonbon' is a square. My first problem would be to check if a 
> string is square-free,
> which means, that no factor (substring) of the string is a square.
> For example, 'square' is square-free, but 'square-free' is not (as the square 
> 'ee' is a factor thereof)
>
> As I'm just learning J, I would be interested in "nice" ways to do this in J, 
> not necessarily the fastest ones.
> This problem can be solved in linear time, see for instance [1], but I'm more 
> interested in how to do it "the J way".
> As a starter, I can check if a string s is a square like that: 3 <: +/ s E. 
> s,s
>
> Another problem would be to calculate the number of distinct squares in a 
> string.
> I think this is also best explained by an example. Let s =. 'aababaabb'.
> Then the distinct squares appearing in s are 'aa, 'bb', 'abab', and 'baba'.
> So the number of distinct squares is 4. Note that the word _distinct_ is 
> important here, as 'aa' occurs twice,
> but we only count it once.
>
> There's also some theoretical motivation for this second problem, as it is 
> shown that this number is at most 2n,
> for any string of length n [2]. It is further conjectured that it is in fact 
> at most n, but this is an open problem,
> and the best bound known so far is 2n - log(n) [3].
>
> I would be interested to see how these problems could be solved nicely in J,
> and hope to learn more of this fascinating language. Thank you!
>
> Mike
>
>
> [1] M. G. Main and R. J. Lorentz. Linear time recognition of squarefree 
> strings. In Combinatorial
> Algorithms on Words, volume F12 of NATO ASI Series, pages 271–278. Springer, 
> 1985.
>
> [2]Fraenkel, A. S., & Simpson, J. (1998). How many squares can a string 
> contain?.
> Journal of Combinatorial Theory, Series A, 82(1), 112-120.
>
> [3]Ilie, L. (2007). A note on the number of squares in a word.
> Theoretical Computer Science, 380(3), 373-376.
>
>
>
>
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Finding repeated substrings

Reply via email to