bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-16 Thread Roy Smith
Yup, this does depend on the locale.  In my original example, I had 
LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:

> $ LANG=C.UTF-8 uniq -c x
>   1 "ⁿᵘˡˡ"
>   1 "ܥܝܪܐܩ"


But, that doesn't fully explain what's going on.  I find it difficult to 
believe that there's any collation sequence in the world where those two 
strings should compare the same.  I've been playing around with the ICU string 
compare demo  
and can't reproduce this there.  Possibly I just haven't hit upon the right 
combination of options to set, but I think it's far-fetched that there's any 
such combination for which those two strings comparing equal is legitimate.



bug#38621: gdu showing different sizes

2019-12-16 Thread Bernhard Voelker
On 2019-12-16 20:43, TJ Luoma wrote:
> AHA! Ok, now I understand a little better. I have seen the difference
> between "size" and "size on disk" and did not realize that applied
> here.

Thanks for confirming.

> I'm still not 100% clear on _why_ two "identical" files would have
> different results for "size on disk" (it _seems_ like those should be
> identical) but I suspect that the answer is probably of a technical
> nature that would be "over my head" so to speak, and truthfully, all I
> really need to know is "sometimes that happens" rather than
> understanding the technical details of why.

Actually the difference is a matter of choice, i.e., how the user wants to
save the file (obviously, most programs come with a certain default preference).

Suppose one writes a file with an "A" at the beginning, then e.g. 1.000.000 NUL
characters, and then a "B".

Then the storing algorithm may decide to either explicitly write all NULs
separately (here displayed as '.') to disk; e.g. 'cp --sparse=never' would do 
so:

  - write "A",
  - write 1.000.000 times a NUL,
  - write "B".

or to try to save some disk space by writing it as a "sparse" file;
e.g. 'cp --sparse=always' would (try to) do so:

  - write an "A",
  - then tell the filesystem that there are 1.000.000 NULs
(which takes just a few bytes physically),
  - write a "B"

The latter method needs support from both the tool and the file system
where the file is stored.

Or with your words: "sometimes that happens". ;-)

> I appreciate you taking the time to educate me further about this.

No worries.  If there's one user who got confused, then there is
the chance that also others might fall into the same issue.
Therefore, if you think we could improve something, e.g. a clarifying
word in the documentation, then this would help us all.

Thanks & have a nice day,
Berny





bug#38621: gdu showing different sizes

2019-12-16 Thread Bob Proulx
TJ Luoma wrote:
> AHA! Ok, now I understand a little better. I have seen the difference
> between "size" and "size on disk" and did not realize that applied
> here.
>
> I'm still not 100% clear on _why_ two "identical" files would have
> different results for "size on disk" (it _seems_ like those should be
> identical) but I suspect that the answer is probably of a technical
> nature that would be "over my head" so to speak, and truthfully, all I
> really need to know is "sometimes that happens" rather than
> understanding the technical details of why.

I think at the start is where the confusion began.  Because the
commands are named to show that they were intended to show different
things.

  'du' is named for showing disk usage

  'ls' is named for listing files

And those are rather different things!  Let's dig into the details.

The long format for information says:

  ‘-l’
  ‘--format=long’
  ‘--format=verbose’
   In addition to the name of each file, print the file type, file
   mode bits, number of hard links, owner name, group name, size, and
   timestamp (*note Formatting file timestamps::), normally the
   modification timestamp (the mtime, *note File timestamps::).  Print
   question marks for information that cannot be determined.

So we know that ls lists the size of the file.  But let me
specifically say that this is tagged to the *file*.  It's file
centric.  There is also the -s option.

  ‘-s’
  ‘--size’
   Print the disk allocation of each file to the left of the file
   name.  This is the amount of disk space used by the file, which is
   usually a bit more than the file’s size, but it can be less if the
   file has holes.

This displays how much disk space the file consumes instead of the
size of the file.  The two being different things.

And then the 'du' documentation says:

  ‘du’ reports the amount of disk space used by the set of specified files

And so du is the disk used by the file.  But as we know the amount of
disk used is dependent upon the file system holding the file.
Different file systems will have different storage methods and the
amount of disk space being consumed by a file will be different and
somewhat unrelated to the size of the file.  Disk space consumed to
hold the file could be larger or smaller than the file size.

In particular if the file is sparse then there are "holes" in the
middle that are all zero data and do not need to be stored.  Thereby
saving the space.  In which case it will be smaller.  Or since files
are stored in blocks the final block will have some fragment of space
at the end that is past the end of the file but too small to be used
for other files.  In which case it will be larger.

Therefore it is not surprising that the numbers displayed for disk
usage is not the same as the file content size.  They would really
only line up exactly if the file content size is a multiple of the
file system storage block size and every block is fully represented on
disk.  Otherwise they will always be at least somewhat different in
number.

As long as I am here I should mention 'df' which shows disk free space
information.  One sometimes thinks that adding up the file content
size should add up to du disk usage size, but it doesn't.  And one
sometimes thinks that adding up all of the du disk usage sizes should
add up to the df disk free sizes, but it doesn't.  That is due to a
similar reason.  File systems reserve a min-free amount of space for
superuser level processes to ensure continued operation even if the
disk is fulling up from non-privileged processes.  Also file system
efficiency and performance drops dramatically as the file system fills
up.  Therefore the file system reports space with the min-free
reserved space in mind.  And once again this is different on different
file systems.

But let me return to your first bit of information.  The ls long
listing of the files.  Your version of ls gave an indication that
something was different about the second file.

> % command ls -l *pkg
> -rw-r--r--  1 tjluoma  staff  5047 Dec 15 00:00 StreamDeck-4.4.2.12189.pkg
> -rw-r--r--@ 1 tjluoma  staff  5047 Dec 15 00:02 
> Stream_Deck_4.4.2.12189.pkg

See that '@' in that position?  The GNU ls coreutils 8.30
documentation I am looking at says:

 Following the file mode bits is a single character that specifies
 whether an alternate access method such as an access control list
 applies to the file.  When the character following the file mode
 bits is a space, there is no alternate access method.  When it is a
 printing character, then there is such a method.

 GNU ‘ls’ uses a ‘.’ character to indicate a file with a security
 context, but no other alternate access method.

 A file with any other combination of alternate access methods is
 marked with a ‘+’ character.

I did not see anywhere that documented what an '@' means.  Therefore
it is likely something applied in a downstream 

bug#38621: gdu showing different sizes

2019-12-16 Thread TJ Luoma
AHA! Ok, now I understand a little better. I have seen the difference
between "size" and "size on disk" and did not realize that applied
here.

I'm still not 100% clear on _why_ two "identical" files would have
different results for "size on disk" (it _seems_ like those should be
identical) but I suspect that the answer is probably of a technical
nature that would be "over my head" so to speak, and truthfully, all I
really need to know is "sometimes that happens" rather than
understanding the technical details of why.

I appreciate you taking the time to educate me further about this.

Cheers

Tj



On Mon, Dec 16, 2019 at 2:47 AM Bernhard Voelker
 wrote:
>
> On 2019-12-16 07:25, TJ Luoma wrote:
> > I sort of followed most of the technical part of that but I still don’t
> > understand why it’s not a bug to show different information about two
> > identical files.
> >
> > Which may indicate that I didn’t understand the technical part very well.
> >
> > As an end user, it’s hard to understand how that inconsistency isn’t both
> > undesirable and a bug.
> >
> > I could maybe see if they were two files with the same byte-count but
> > different composition that made the calculations off by 1, but this is an
> > identical file and it’s showing up with two different sizes, in a tool
> > meant to report sizes.
> >
> > That just seems “obviously” wrong even if it’s somehow technically
> > explainable.
>
> Thanks for following up on this for further clarifications.
>
> I think the problem is the word "size":
> while 'ls' and 'du --apparent-size' show the length of the content of
> a file, 'du' (without --apparent-size') reports the space the file
> needs on disk.
>
>   $ du --help | sed 3q
>   Usage: du [OPTION]... [FILE]...
> or:  du [OPTION]... --files0-from=F
>   Summarize disk usage of the set of FILEs, recursively for directories.
> ^^
>
> One reason for those sizes to differ are "holes".  As an extreme case,
> one can create a 4 Terabyte file (just NULs) on a filesystem which is
> much smaller than that:
>
>   # Filesystem size.
>   $ df -h --out=size,target .
>Size Mounted on
>591G /mnt
>
>   # Create a NUL-only file of size 4 Terabyte.
>   $ truncate -s4T f2
>
>   # 'ls' shows the 4T of file size.
>   $ ls -logh f2
>   -rw-r--r-- 1 4.0T Dec 16 08:36 f2
>
>   # 'du' shows that the file does not even require any disk usage.
>   $ du -h f2
>   0 f2
>
>   # ... but with '--apparent-size' reports the real (content) size.
>   $ du -h --apparent-size f2
>   4.0T  f2
>
>   # Any program will see the 4T content transparently.
>   $ wc -c < f2
>   4398046511104
>
> In your case, the file was a mixture of regular data and holes,
> and 'cp' (without --sparse=always) tried to automatically determine
> if the target file should have holes or not (see 'man cp').
> Therefore, your 2 files had a different disk usage, but the net length
> of the content is identical, of course.
>
> Have a nice day,
> Berny





bug#38627: uniq -c gets wrong count with non-ascii strings

2019-12-16 Thread Paul Eggert
On 12/15/19 11:40 AM, Roy Smith wrote:
> With the following input:
> 
>> $ cat x
>> "ⁿᵘˡˡ"
>> "ܥܝܪܐܩ"
> 
> 
> Running "uniq -c" says there's two copies of the same line!
> 
>> $ uniq -c x
>>   2 "ⁿᵘˡˡ"

Thanks for the bug report. I expect this is because GNU 'uniq' uses the
equivalent of strcoll (locale-dependent comparison) to compare lines, whereas
macOS 'uniq' uses the equivalent of strcmp (byte comparison). Since the two
lines compare equal in your locale, GNU 'uniq' says there's just one line.

The GNU 'uniq' behavior appears to be a consequence of this commit:

commit 545c2323d493c7ed9c770d9b8e45a15db6f615bc
Author: Jim Meyering 
Date:   Fri Aug 2 14:42:37 2002 +

with a change noted this way in NEWS:

* uniq now obeys the LC_COLLATE locale, as per POSIX 1003.1-2001 TC1.

However, the 2016 edition of POSIX removed mention of LC_COLLATE from 'uniq',
and I expect this means that the 2002 commit should be reverted so that GNU
'uniq' behaves like macOS 'uniq' (a behavior that I think makes more sense 
anyway).

I'll CC: this email to Jim Meyering to see whether he has an opinion about this.

In the meantime you can work around the problem by using 'LC_ALL=C uniq' instead
of plain 'uniq' in your shell script.