On 15 Feb 2022 21:17, Jacob Bachmeyer wrote:
> Mike Frysinger wrote:
> > context: https://bugs.gnu.org/53340
> >   
> Looking at the highlighted line in the context:

thanks for getting into the weeds with me

> > >   echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
> It seems that the problem is that am__base_list expects ListOf/File (and 
> produces ChunkedListOf/File) but am__pep3147_tweak emits ListOf/Glob.  
> This works in the usual case because the shell implicitly converts Glob 
> -> ListOf/File and implicitly flattens argument lists, but results in 
> the overall command line being longer than expected if the globs expand 
> to more filenames than expected, as described there.
> 
> It seems that the proper solution to the problem at hand is to have 
> am__pep3147_tweak expand globs itself somehow and thus provide 
> ListOf/File as am__base_list expects.
> 
> Do I misunderstand?  Is there some other use for xargs?

if i did not care about double expansion, this might work.  the pipeline
quoted here handles the arguments correctly (other than whitespace splitting
on the initial input, but that's a much bigger task) before passing them to
the rest of the pipeline.  so the full context:

  echo "$$py_files" | $(am__pep3147_tweak) | $(am__base_list) | \
  while read files; do \
    $(am__uninstall_files_from_dir) || st=$$?; \
  done || exit $$?; \
...
am__uninstall_files_from_dir = { \
  test -z "$$files" \
    || { test ! -d "$$dir" && test ! -f "$$dir" && test ! -r "$$dir"; } \
    || { echo " ( cd '$$dir' && rm -f" $$files ")"; \
         $(am__cd) "$$dir" && rm -f $$files; }; \
  }

leveraging xargs would allow me to maintain a single shell expansion.
the pathological situation being:
  bar.py
  __pycache__/
    bar.pyc
    bar*.pyc
    bar**.pyc

py_files="bar.py" which turns into "__pycache__/bar*.pyc" by the pipeline,
and then am__uninstall_files_from_dir will expand it when calling `rm -f`.

if the pipeline expanded the glob, it would be:
  __pycache__/bar.pyc __pycache__/bar*.pyc __pycache__/bar**.pyc
and then when calling rm, those would expand a 2nd time.

i would have to change how the pipeline outputs the list of files such that
the final subshell could safely consume & expand.  since this is portable
shell, i don't have access to arrays & fancy things like readarray.  if the
pipeline switched to newline delimiting, and i dropped $(am__base_list), i
could use positionals to construct an array and safely expand that.  but i
strongly suspect that it's not going to be as performant, and i might as
well just run `rm` once per file :x.

  echo "$$py_files" | $(am__pep3147_tweak) | \
  ( set --
    while read file; do
      set -- "$@" "$file"
      if test $# -ge 40; then
        rm -f "$@"
        set --
      fi
    done
    if test $# -gt 0; then
      rm -f "$@"
    fi
  )

which at this point i've written `xargs -n40`, but not as fast :p.

> > automake jumps through some hoops to try and limit the length of generated
> > command lines, like deleting output objects in a non-recursive build.  it's
> > not perfect -- it breaks arguments up into 40 at a time (akin to xargs -n40)
> > and assumes that it won't have 40 paths with long enough names to exceed the
> > command line length.  it also has some logic where it's deleting paths by
> > globs, but the process to partition the file list into groups of 40 happens
> > before the glob is expanded, so there are cases where it's 40 globs that can
> > expand into many many more files and then exceed the command line length.
> 
> First, I thought that GNU-ish systems were not supposed to have such 
> arbitrary limits,

one person's "arbitrary limits" is another person's "too small limit" :).
i'm most familiar with Linux, so i'll focus on that.

xargs --show-limits on my Linux-5.15 system says:
Your environment variables take up 5934 bytes
POSIX upper limit on argument length (this system): 2089170
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2083236

2MB ain't too shabby.  but if we consult execve(2), it has more details:
https://man7.org/linux/man-pages/man2/execve.2.html
       On Linux prior to kernel 2.6.23, the memory used to store the
       environment and argument strings was limited to 32 pages (defined
       by the kernel constant MAX_ARG_PAGES).  On architectures with a
       4-kB page size, this yields a maximum size of 128 kB.

i've def seen "Argument list too long" errors in Gentoo from a variety of
packages due to this 128 kB limit (which includes the environ strings).
users are very hostile to packages :p.

       On kernel 2.6.23 and later, most architectures support a size
       limit derived from the soft RLIMIT_STACK resource limit (see
       getrlimit(2)) that is in force at the time of the execve() call.
       (Architectures with no memory management unit are excepted: they
       maintain the limit that was in effect before kernel 2.6.23.)
       This change allows programs to have a much larger argument and/or
       environment list.  For these architectures, the total size is
       limited to 1/4 of the allowed stack size.  (Imposing the
       1/4-limit ensures that the new program always has some stack
       space.)  Additionally, the total size is limited to 3/4 of the
       value of the kernel constant _STK_LIM (8 MiB).  Since Linux
       2.6.25, the kernel also places a floor of 32 pages on this size
       limit, so that, even when RLIMIT_STACK is set very low,
       applications are guaranteed to have at least as much argument and
       environment space as was provided by Linux 2.6.22 and earlier.
       (This guarantee was not provided in Linux 2.6.23 and 2.6.24.)
       Additionally, the limit per string is 32 pages (the kernel
       constant MAX_ARG_STRLEN), and the maximum number of strings is
       0x7FFFFFFF.

so things are much better now, but we're still subject to whatever the
rlimit stack is set to.

i also vaguely recall older versions of GNU/findutils's xargs (and i see
some reports in Gentoo) where even `find ... | xargs ...` fails with
"xargs: ...: Argument list too long".  so i'd be inclined to still use
an -n limit with xargs rather than just let it go forth.

plus, backing up, Automake can't assume Linux.  so i think we have to
proceed as if there is a command line limit we need to respect.

> and this issue (the context) originated from Gentoo 
> GNU/Linux.  Is this a more fundamental bug in Gentoo or still an issue 
> because Automake build scripts are supposed to be portable to foreign 
> system that do have those limits?

to be clear, what's failing is an Automake test.  it sets the `rm` limit to
an articially low one.  in t/instmany-python.sh it does:
limit=4500
sed "s|@limit@|$limit|g" >x-bin/'rm' <<'END'
#! /bin/sh
limit=@limit@
PATH=$oPATH; export PATH
RM='rm -f'
len=`expr "$RM $*" : ".*" 2>/dev/null || echo $limit`
if test $len -ge $limit; then
  echo "$0: safe command line limit of $limit characters exceeded" >&2
  exit 1
fi
exec $RM "$@"
exit 1
EOF

Gentoo happened to find this error before Automake because Gentoo also found
and fixed a Python 3.5+ problem -- https://bugs.gnu.org/38043.  once we fix
that in Automake too, we see this same problem.  i'll remove "Gentoo" from
the bug title to avoid further confusion.

> I note that the current version of standards.texi also allows configure 
> and make rules to use awk(1); could that be useful here instead? (see below)
> ...
> Second, counting files in the list, as you note, does not necessarily 
> actually conform to the system limits, while Awk can track both number 
> of elements in the list and the length of the list as a string, allowing 
> to break the list to meet both command tail length limits (on Windows or 
> total size of block to transfer with execve on POSIX) and argument count 
> limits (length of argv acceptable to execve on POSIX).
> 
> POSIX Awk should be fairly widely available, although at least Solaris 
> 10 has a non-POSIX awk in /usr/bin and a POSIX awk in /usr/xpg4/bin; I 
> found this while working on DejaGnu.  I ended up using this test to 
> ensure that "awk" is suitable:
> 
> 8<------
> # The non-POSIX awk in /usr/bin on Solaris 10 fails this test
> if echo | "$awkbin" '1 && 1 {exit 0}' > /dev/null 2>&1 ; then
>     have_awk=true
> else
>     have_awk=false
> fi
> 8<------
> 
> 
> Another "gotcha" with Solaris 10 /usr/bin/awk is that it will accept 
> "--version" as a valid Awk program, so if you use that to test whether 
> "awk" is GNU Awk, you must redirect input from /dev/null or it will hang.
> 
> Automake may want to do more extensive testing to find a suitable Awk; 
> the above went into a script that remains generic when installed and so 
> must run its tests every time the user invokes it, so "quick" was a high 
> priority.

i noticed that autoconf uses awk.  i haven't dug deeper though to see what
language restrictions are there.  GNU awk is obviously out, and POSIX awk
isn't so bad, but do autotools target lower?  it doesn't quite solve the
problem though as the biggest issue is interacting with the filesystem via
globs and quoting.  awk doesn't have a glob().  it has a system() which is
just arbitrary shell code which is what i already have :(.

i think if we're at the point where we have to probe the functionality of
tools, i think probing for xargs (or find) is simpler.  we can leverage it
if available, otherwise fallback to doing one `rm` per file.  i think that
will make it perform well on the vast majority of systems while not breaking
anyone anywhere.
-mike

Attachment: signature.asc
Description: PGP signature

Reply via email to