Hello Collin,
thanks for the explanation.

>> Note that '$' behaves as expected here:
>>
>> $ for i in 1 2 3; do head -c8192 /dev/zero | tr \\0 $i; done |
>>   tac -rs '$' | od -Ax -tx1z
>> 0000 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31  >1111111111111111<
>> *
>> 2000 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32  >2222222222222222<
>> *
>> 4000 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33 33  >3333333333333333<
>> *
>> 6000
> 
> Yep, 'tac' reads 8192 bytes at a time. In this case, "^" matches the
> beginning of the string. Therefore, the start of each 8192 bye buffer.

I see.
Then I wonder why "$" doesn't have the same effect, since it should 
likewise match the end of each 8192 byte buffer causing tac to print them 
in reverse order.

> It seems like you thought that "^" and "$" operate on lines of text
> instead of buffers, is that correct? I think there was some attempt to
> address that here [1]:

I thought that "^" and "$" operate on the input as a whole.

  [1] "^" matches start-of-string or after-newline, and
  [2] "$" matches end-of-string or before-newline

They do match after-newline and before-newline respectively, as expected.

And they do also match start-of-string and end-of-string respectively, 
except that the notion of "string" was not what I expected: I assumed it 
was the whole input. Instead it is the buffered portion of it.

It was clear to me that the search is not done line by line, since lines 
may be irrelevant to the record structure we are trying to define.
Of course I can imagine there is buffering involved and tac works on a 
portion of the input at a time, but unless documented, I would expect 
buffering to be just an implementation detail. So "^" to anchor only to the 
very start of input or after a newline, and "$" to anchor only to the very 
end of input or before a newline.

I realize that such anchors probably have only a niche use in tac, but I 
think you can be sure that whenever "^" matches anything other than the 
very start of input or after a newline, it is not what the user intended, 
because the buffer's boundaries are not meaningful to the user.

Perhaps these spurious matches could be avoided by setting the not_bol (not 
at beginning of line) and the not_eol fields [1] correctly for each search.

    If the not_bol field is set in the pattern buffer, then "^" fails 
    to match at the beginning of the string. This lets you match 
    against pieces of a line, as you would need to if, say, searching 
    for repeated instances of a given pattern in a line; it would work 
    correctly for patterns both with and without 
    match-beginning-of-line operators.

By avoiding such matches, for example it would become possible to do 
something like this:

$ tac -rs  '\`REGEXP'  # move REGEXP from START to END of file
$ tac -brs "REGEXP\'"  # move REGEXP from END to START of file

I also assumed that regex separators were searched going forward and 
according to the "leftmost longest" rule [3]. Instead they are searched 
going backward and according to the "rightmost longest" rule. This makes a 
difference, for example:

$ printf '<one><<two>><<<three>>>' | tac -rs '>+'
>><<<three>><<two><one>

I.e. the records look like this (backward rightmost longest search):
"<one>" "<<two>" ">" "<<<three>" ">" ">"

But I wouldn't be surprised if most users expect them to look like this 
(forward leftmost longest search):
"<one>" "<<two>>" "<<<three>>>"
reverse: "<<<three>>><<two>><one>"

>     Records are separated by instances of a string (newline by default).
>     By default, this separator string is attached to the end of the record
>     that it follows in the file.
> 
> Perhaps a note is needed under the description of '--regex' that "^" and
> "$" operate on records instead of lines. What do you think of that idea?

It doesn't seem correct to say they operate on records when they are part 
of the expression used to define what a record is. And they do match 
after-newline and before-newline respectively.
Perhaps you mean that, when a separator is removed from the buffer, the end 
of the buffer coincides with the end of the non-separator part (which may 
be empty), therefore "$" matches that boundary. So the notion of "$" is 
somewhat tied to the previous match. But I wouldn't describe that as 
"operating on records" and I would be probably even more confused by such 
statement.

Perhaps it would be sufficient to warn that anchors may match arbitrary 
buffer boundaries, due to internal buffer handling.

I would also recommend to document:

- That regex separators are searched going backward and according to the 
"rightmost longest" rule.

- Which flavor of regex tac takes. I assumed it was either BRE or ERE as in 
GNU grep/sed, but it is neither (+ is an operator but | is not). It doesn't 
need to be a full syntax description.

[1] 
https://www.gnu.org/software/gnulib/manual/html_node/Match_002dbeginning_002dof_002dline-Operator.html
[2] 
https://www.gnu.org/software/gnulib/manual/html_node/Match_002dend_002dof_002dline-Operator.html
[3] 
https://www.gnu.org/software/gnulib/manual/html_node/What-Gets-Matched_003f.html

Reply via email to