It's clear that Dan Sumption is not interested in collaborating
with me on finding structure in files, because he thinks files
should never be large enough to *have* internal structure, so
this is my last reply to him in this thread.  I wondered about
just ignoring the message, but maybe someone has some empirical evidence.

On 29/04/16 10:30 PM, Dan Sumption wrote:

        To me, a 1000-line module is a God Class. A 3000-line module
        is a complete disaster.

        Accepted best practice is that a file too big to view on your
        screen is too long. Optimum file size is probably under 30 lines.

    Really?  I've heard that said about *function* size, but not about
    *module* size.


Really. *File* size. One should be able to view, and make sense of, an entire file. On one screen.

Suppose a method is 4.5 lines (a typical Smalltalk average) and a screen
is 45 lines long (which I can get on my screen).  Allow some lines for
declaring the class and its variables.  You are saying that no class should
ever contain more than 9 methods.

If we allow one extra line per method for a comment, and an extra line
per class for a comment, you are saying that no *documented* class
should ever contain more than 7 methods.

I've already pointed out that the Python, Erlang, and SML implementations
do not adhere to the tiny-files rule.  So now I'm going to pick an arbitrary
class from C# and count methods.  The class had better be a *non-trivial*
one, something that wouldn't be one line of Haskell, for example.
Let's try System.Drawing.Drawing2D.Matrix.  Excluding inherited things,
    1 constructor
    5 properties
   15 METHODS
For a total of 21 all up.  Oh wait, 8 of the methods are overloaded, so
there are actually at least 29 methods.

Sounds like the C# class library does not follow this rule either.

Well, let's try something that's not a language implementation.  I often
use R to plot graphs.  Maybe I should use Java.  Let's take a look at
JFreeChart.  Oh my gosh, my head hurts.  Object orientation seems to
drive people mad with an urge to reify everything.  One thing you MUST
come to terms with if you are going to plot graphs is axes.
org.jfree.chart.axis.Axis
   15 visible static variables
    1 constructor
   72 methods

Now let's try something I am working on, an SML implementation of Dijkstra's
arrays, as used in "A Discipline of Programming".  To *be* an implementation
of Dijkstra arrays, we need
 - a constructor
 - an indexed getter
 - an indexed setter
 - 3 properties lob, hib, dom
 - 2 properties top, bot
 - 2 adders hiext, loext
 - 4 removers hirem, hipop, lorem, lopop
 - 1 origin shifter
 - 1 element swapper
for a total of 16 functions/methods.

To make a Dijkstra array act like other collections in the SML Basis Library,
I need at least 23 more functions.  With a header comment, the *interface*
file is already 49 lines, and you are telling me not to write a file that long.
Oops, I was missing one.  50 lines, *interface*.

So far the implementation is 219 lines, and there are 5 functions to go.
Many of the functions *are* one line.  Here's a typical one that isn't:

    val fromList : int -> 't list -> 't darray (* in the interface *)

    fun fromList lob xs =(* in the implementation *)
        let val dom = length xs
         in DA {
               lob = ref lob,
               dom = ref dom,
               b   = ref 0,    (* no empty part at left *)
               e   = ref dom,  (* no empty part at right *)
               arr = ref (Array.fromList xs)
           }
       end
Functions should be even smaller, IMO no more than five lines. Ideally one.
This function has to return a record with five fields.
I suppose I could write

    fun fromList lob xs = DA { lob = ref lob,
        dom = ref (length xs), b = ref 0, e = ref (length xs),
        arr = ref (Array.fromList xs) }

but I don't think that is more readable.  The idea that squeezing this
very simple function into "no more than five lines, ideally one" would
make it *better* is very hard for me to believe.

I would be astonished if you could implement a good quality math
library with five line functions.

Quoting from my bible for software development, Clean Code - http://amzn.to/1VWmzEV

The first rule of classes is that they should be small. The second rule of classes is that they should be smaller than that. No, we're not going to repeat exactly the same text from the Functions chapter. But as with functions, smaller is the primary rule when it comes to designing classes. As with functions, our immediate question is always "How small?"

Sigh.  Is there any empirical evidence that 30-line FILES are a good idea?

When you work on code, work on a single piece of functionality at a time.
This presupposes that "one piece of functionality" is a well defined
concept, and that a piece of *functionality* is never ever ever spread
across two chunks.  Working on one *method* often requires me to work
on many *classes* at a time; that's what polymorphism is all about.

It also seems to presuppose that code has no CONTEXT.
For example, minimal documentation for darray.sml looks like this:

    - what's the file name?
    - when was it last revised?
    - who is responsible for it?
    - This file implements single-index extensible arrays as
    - defined by E.W.Dijkstra in "A Discipline of Programming".
    - It supports all the array operations used in that book.
    - It also supports as much of the Array structure in the
    - Standard ML Basis Library as could be adapted.

A module requires a minimum of 3 lines:

    structure Darray : DARRAY =
       struct
          ...
       end;

That's 11 lines out of 45, leaving me just 34 lines for 39 functions.
Even with just the core 16, that's 2 lines each, which is NOT going
to work.

The open file is your workbench. It should mesh well with your working memory.

Yes, but "meshing well with my working memory" means for me,
"something I can use INSTEAD of working memory."  That is, for example,
why I need the definition of DA on screen at the same time as the code
that is creating one: so that it's ***NOT*** in my working memory.

We really need some evidence about what is a good way to support
your working memory:  is replicating it better, or is supplementing
it better?  How would you tell?

Admittedly this creates another type of complexity: the complexity of many files.
It sounds like something that needs empirical research about which is worse.
My *personal* feeling is that "vast collections of teeny-tiny files" is worse because with a medium size single-topic file, at least I know where to look for stuff.
I really hate that phrase "non-trivial".
That is an interesting fact about you.

We would make it concrete: a class that provides at most one behaviour other than getters, setters, and toString comes pretty close to what I had in mind.

It came up a lot recently in relation to the NPM left-pad fiasco, along with statements like "have we all forgotten how to program?"

Having looked at the left-pad *code*, it seemed that even the author
of it had forgotten a fair bit.  It wasn't only easy to write left pad,
it was easy to write it *better*.  Heck, it took me 5 minutes, most of
which was looking stuff up because I don't do much JavaScript.

Here's another definition of trivial: code where it's less effort to
write your own than to find it.

It's not clear what "padding on the left to width n" means
if a string contains format effector characters or will be
displayed in a variable width font, and I agree that dealing
with or even documenting those issues *would* have made left pad
non-trivial.  But it did neither.

Small does not mean trivial.

In that particular case, it did.
Small, single responsibility classes are perhaps the most useful and the most reusable.

A single responsibility is not the same as a single method.
As an example, the left-pad fiasco occurred for a number of reasons,
one of them being that this commonly desired operation was not
already in the JavaScript string interface.

        Even then, I struggle to conceive of a case where a 1,000 line
        file could be broken down into 7 clear, comprehensible concepts.

    You seem to be talking about a major rewrite, which I'm not.


I'm talking about *Single Responsibility Principle*.
https://en.wikipedia.org/wiki/Single_responsibility_principle
What that page says is
    The single responsibility principle states that every module or class
    should have responsibility over a single part of the functionality
    provided by the software, and that responsibility should be entirely
encapsulated by the class.

A *single responsibility* is not the same thing as a single
*function* or as a tiny amount of code.  You really can have a
thousand line file with a single responsibility.  Some things
are just algorithmically challenging.

You can't get much more "single responsibility" than
 "given a character stream, return the next token from it".

The last tokeniser I wrote, for a rather simple but real programming
language, took 80 lines of Lex (which really couldn't have been any
shorter) and 32 lines of C.  This thing doesn't even convert numbers
from string form to numeric form, nor does it do anything with
string literals other than recognise them.  (No escape translation.)
It does one thing and one thing only: read the next token.

That is a single responsibility.

You cannot even fit a list of what the tokens ARE into 45 lines.
as there are 51 of them (including the automatic end-of-file).

If you are modifying a file, that's because its responsibility has changed.

(a) I was talking about READING files, not just modifying them.
(b) No, the responsibility of a file may be exactly the same, but
    the world may have changed.

For example, the language that I mentioned the tokeniser for was originally
designed for a 6-bit character set, then adapted to ASCII, and the tokeniser
I wrote handles ISO Latin 1.  But the world has moved on to Unicode.  Since
this is a fun reconstruction of a dead programming language (with a living
but incompatible successor), I don't actually care all that much about
Unicode.  But a large number of compilers for other languages have had to
change their lexical analyser and indeed aspects of their symbol tables
IN ORDER TO KEEP ON DOING THE SAME THING, for a very important sense of "same".

The Martin diktat "A class should have only one reason to change"
demonstrably fails for tokenisers:
 1. the language to be tokenised might change.  (Martin diktat.)
 2. the system's character set might change.  (Red Queen reality.)
 3. a library the tokeniser depends on might change. (Other people's code.)

You could say "oh, the responsibility was read-next-token-from-ASCII-stream
and now it's read-next-token-from-Unicode-stream", but that's a revisionist
view: when the original tokenisers were written, that's not how they were
thought of.  Nobody thought of "ASCII instead of Unicode" because Unicode
did not then exist.

As it happens, the tokeniser in question also had to be changed
for reason 3.  The lex library on one system turned out to have an
undocumented feature/quirk/bug.  My code had to change in order to
do the same thing.  (This wasn't even an OS difference.)

Let's take another example.  There was a golden era in Objective C's history
when Apple provided a version of Objective C on MacOS X that supported
garbage collection.  Then they changed their minds, and reverted to
semi-automatic reference counting.  Working code written during the brief
Gold Age had to be modified IN ORDER TO KEEP ON DOING THE SAME THING, not
because *its* responsibilities had changed in any way, but because Apple
had decided to break things.  (This is far from the only change that Apple
have made that has broken things.)

But I repeat, there are many other reasons to read other people's code than
an intention to modify it.
to understand the impact of your changes to that file, you need to have clear in your mind everything that the file does.

On the one hand, in my experience that claim is simply false.
We'd never get anything done if it were true.
For example, I once fixed a bug in the UNIX V7 PDP-11 C compiler
without knowing much about what most of the file I changed did.
All I had to know was that "bad code is generated for this
construction" and "this is the only part of the file that's involved
in that construction" and "that part isn't involved in any other
construction."

What's more, if it *were* true, then breaking a file up into
lots of smaller pieces could not actually help, because the
logic of "you can't change anything without understanding
everything" applies just as much when a responsibility is spread
over dozens of files as when it's in a single file.
If it's a 10 line file, that's relatively easy.
Well, no. Because a 10-line file isn't going to have much *private* code that
can be safely changed because it's hidden behind an interface. You're going
to have to hunt down every place the thing exported by that file is *used* to
make sure the change is safe.

To use your own example, left pad fits the "10-line file" model pretty well.
And the code could be significantly more efficient.  Hunt down its uses?
Good luck with that!

If it's a 1,000 line file, good luck!

With a 1000 line file, a lot of the code is or should be
private, and so a higher proportion of the code will be safer
to change.

For what it's worth, I *have* maintained thousand-line files
I didn't write, and you know what?  It was pretty easy, if
there were decent comments.  (And I don't mean JavaDoc.)

If I change my complaint about large modules looking like
the same kind of stuff over and over with few to no clues
about the structure, to one about large "subsystems",
nothing of importance to me changes, except that subsystems
made of lots of files are even worse *for me* to deal with.


--
You received this message because you are subscribed to the Google Groups "PPIG 
Discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to ppig-discuss+unsubscr...@googlegroups.com.
To post to this group, send an email to ppig-discuss@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to