Re: echo -e produces no outputs

2009-07-17 Thread Bo Borgerson

Bauke Jan Douma wrote:

Eric Blake wrote on 07/17/2009 09:09 PM:

Instead of using echo (which POSIX itself admits is fraught with
portability problems), use printf:

printf -- '-e\n'



or:
echo -n - ; echo e


or:
echo -e \0055e

;)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [PATCH] tests: refactor to use the new getlimits utility

2008-12-12 Thread Bo Borgerson
Pádraig Brady wrote:
 I'd especially like a review of the perl bits


Hi Pádraig,

I'm not sure that this function will behave quite as you intended it to:

+sub getlimits()
+{
+  my $NV;
+  open NV, getlimits | or die Error running getlimits\n;
+  my %limits = map {split /=|\n/} NV;
+  return \%limits;
+}
+

I think that filehandle is opened using a broader scope than you might
be expecting.  It's not using your subroutine-scoped lexical $NV, but
rather a package-scoped NV symbol.  If you open $NV and read using
$NV it will use your subroutine-scoped lexical variable, as I think
you intended.

Please see the attached demonstration of this:

$ ./filehandles.pl
limits: 'foo'
caller: ''

$ ./filehandles.pl safe
limits: 'foo'
caller: 'bar'

Thanks,

Bo
#!/usr/bin/perl

my ($safe) = @ARGV;

system (echo -n foo  test_foo);
system (echo -n bar  test_bar);

sub getlimits {
  my $NV;

  # This clobbers caller's NV
  open NV, , test_foo or die Failed to open test_foo: $!;
  NV;
}

sub getlimits_safe {
  my $NV;

  # This uses subroutine-scoped $NV
  open $NV, , test_foo or die Failed to open test_foo: $!;
  $NV;
}

open NV, , test_bar or die Failed to open test_bar: $!;

my $limits = $safe ? getlimits_safe : getlimits;
my $caller = NV;

print limits: '$limits'\n;
print caller: '$caller'\n;
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [bug #24974] Document that comm's option -1, -2 and -3 can be combined

2008-12-01 Thread Bo Borgerson
Pádraig Brady wrote:
 
 p.s. Those new --check-order --nocheck-order options confuse me.
 When they were added I only took a quick look at the implementation
 rather than the interface (which Bo Borgerson kindly sped up for us).
 Perhaps something like this would be clearer:
 
   --check-order={none,mismatch,unsorted}
   By default --check-order=mismatch is enabled.
 
 I suppose it's too late to change now.
 

Hi Pádraig,

If I remember correctly the three possibilities are effectively severity
levels for an 'out-of-order' exception:

--nocheck-order = SILENT (don't actually check)
  [DEFAULT] = WARNING
--check-order   = FATAL

For me mismatch and unsorted aren't obvious keywords, but I can see
how an argument to the --check-order option could be clearer than the
current interface.

Would an _optional_ argument using a scheme like the one you suggested
above be worth providing?  I suspect it might actually add to confusion
due to the need for continued support for the current scheme as well,
but it should be possible to allow both:

--nocheck-order = --check-order=none
  [DEFAULT] = --check-order=warning
--check-order   = --check-order=fatal

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [RFC] wc: add new option, --progress

2008-11-03 Thread Bo Borgerson
Pádraig Brady wrote:
 I'm not sure this is generally that useful.
 It reminds me of the more general pv tool that
 I have found useful in the past:
 http://www.ivarch.com/programs/quickref/pv.shtml
 

Thanks Pádraig, this is exactly what I was looking for :)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[RFC] wc: add new option, --progress

2008-10-31 Thread Bo Borgerson
Hi,

I've recently found myself wishing for an option in 'wc' that shows
progress during an invocation.  I modified my local copy with the
changes in the attached patch to accept a '--progress' option.

This patch is also available at git://repo.or.cz/coreutils/bo.git

An example of behavior can be observed with the attached 'slowrite' program:

$ ./slowrite 10 | src/wc --progress
 10  20  80

$ src/wc --progress (./slowrite 10 3) (./slowrite 100 4)
 10  20  80 /dev/fd/63
100 200 800 /dev/fd/62
110 220 880 total

Of course these examples don't show any difference from an ordinary wc
invocation once they're complete. ;)

Here's a view when stdout isn't attached to a terminal:

$ ./slowrite 10 | src/wc --progress  log 
[1] 19880
$ tail -f log
   8704   17408   69632
  16896   33792  135168
  25088   50176  200704
  33792   67584  270336
  41984   83968  335872
  50176  100352  401408
  58368  116736  466944
  67072  134144  536576
  75264  150528  602112
  83456  166912  667648
  92160  184320  737280
 10  20  80

Would this be useful to anyone else?

Thanks,

Bo
From 80089c8b1d616ec3c3a88b4f58131506a5aa43f3 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 29 Oct 2008 11:00:06 -0400
Subject: [PATCH] wc: add new option, --progress

* src/wc.c (set_do_monitor) ALRM handler sets DO_MONITOR flag that triggers output.
(write_counts) Unset DO_MONITOR flag.  New argument, is_final, set to true for the
final invocation for each input. If stdout is connected to a terminal, write all
counts for a single input on a single line.
(wc) Check DO_MONITOR flag in read loop and call write_counts if set.
(main) Set up an ALRM sigaction handler to call set_do_monitor.
---
 src/wc.c |   89 ++---
 1 files changed, 78 insertions(+), 11 deletions(-)

diff --git a/src/wc.c b/src/wc.c
index 0bb1929..771d6f1 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -21,6 +21,7 @@
 
 #include stdio.h
 #include getopt.h
+#include signal.h
 #include sys/types.h
 #include wchar.h
 #include wctype.h
@@ -66,6 +67,9 @@ static int number_width;
 /* True if we have ever read the standard input. */
 static bool have_read_stdin;
 
+/* Set by sig ALRM handler, triggers output at convenience.  Unset at output. */
+static bool do_progress = false;
+
 /* The result of calling fstat or stat on a file descriptor or file.  */
 struct fstatus
 {
@@ -81,7 +85,8 @@ struct fstatus
non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
 enum
 {
-  FILES0_FROM_OPTION = CHAR_MAX + 1
+  FILES0_FROM_OPTION = CHAR_MAX + 1,
+  PROGRESS_OPTION
 };
 
 static struct option const longopts[] =
@@ -92,6 +97,7 @@ static struct option const longopts[] =
   {words, no_argument, NULL, 'w'},
   {files0-from, required_argument, NULL, FILES0_FROM_OPTION},
   {max-line-length, no_argument, NULL, 'L'},
+  {progress, no_argument, NULL, PROGRESS_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -122,6 +128,7 @@ read standard input.\n\
   --files0-from=Fread input from the files specified by\n\
NUL-terminated names in file F\n\
   -L, --max-line-length  print the length of the longest line\n\
+  --progress print counts every second until complete\n\
   -w, --wordsprint the word counts\n\
 ), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
@@ -131,6 +138,13 @@ read standard input.\n\
   exit (status);
 }
 
+static void
+set_do_progress (int sig)
+{
+  do_progress = true;
+  alarm (1);
+}
+
 /* FILE is the name of the file (or NULL for standard input)
associated with the specified counters.  */
 static void
@@ -139,39 +153,58 @@ write_counts (uintmax_t lines,
 	  uintmax_t chars,
 	  uintmax_t bytes,
 	  uintmax_t linelength,
-	  const char *file)
+	  const char *file,
+	  bool is_final)
 {
   static char const format_sp_int[] =  %*s;
   char const *format_int = format_sp_int + 1;
   char buf[INT_BUFSIZE_BOUND (uintmax_t)];
+  static size_t plen = 0;
+
+  if (0  plen  isatty (STDOUT_FILENO))
+{
+  size_t i;
+  for (i = 0; i  plen; i++)
+putchar (0x08); /* Backspace. */
+  plen = 0;
+}
 
   if (print_lines)
 {
-  printf (format_int, number_width, umaxtostr (lines, buf));
+  plen += printf (format_int, number_width, umaxtostr (lines, buf));
   format_int = format_sp_int;
 }
   if (print_words)
 {
-  printf (format_int, number_width, umaxtostr (words, buf));
+  plen += printf (format_int, number_width, umaxtostr (words, buf));
   format_int = format_sp_int;
 }
   if (print_chars)
 {
-  printf (format_int, number_width, umaxtostr (chars, buf));
+  plen += printf (format_int, number_width, umaxtostr (chars, buf));
   format_int = format_sp_int;
 }
   if (print_bytes)
 {
-  printf (format_int

Re: RFC: wc --max-line-length vs. TABs [Re: Bug in wc

2008-08-22 Thread Bo Borgerson
Jim Meyering wrote:
 
 I'm tempted to make the change, but it seems too drastic, after 11 years.
 Do any of you rely on the current TAB-counting behavior of GNU wc?
 

Hi,

It looks like TAB characters aren't alone in being counted by printed
width rather than count:

$ echo '好' | wc -L
2

Does it make sense to change the behavior for TAB, but not for wide
characters?

Bo
diff --git a/src/wc.c b/src/wc.c
index 0bb1929..b3f1ab2 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -378,7 +378,7 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
 		{
 		  int width = wcwidth (wide_char);
 		  if (width  0)
-			linepos += width;
+			linepos ++;
 		  if (iswspace (wide_char))
 			goto mb_word_separator;
 		  in_word = true;
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: sort -m does too much work

2008-08-05 Thread Bo Borgerson
David Muir Sharnoff wrote:
 I've got 200 1GB pre-sorted files.   If I try to merge
 them with sort -m, it is obvioulsy trying to do too much
 work: after running for a couple minutes, it has not
 produced any output but it has made a 5 GB temporary
 file.
 
 When the input is pre-sorted, no temporary file should
 be required.
 
 Output should begin immediately.


Hi David,

The reason you're not seeing output immediately is because sort
internally limits the number of files it will read at once.  By default
this limit is set to 16.  When more files are to be merged, sort uses
temporary files.

Starting in release 7.0 this limit will be modifiable on the
command-line using the --bath-size=N option.  With 200 files you'll
still need to balance your desire for immediate output against the
performance implications of reading from so many files at once, but the
choice will be yours.

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-28 Thread Bo Borgerson
James Youngman wrote:
 On Thu, Jun 26, 2008 at 1:22 AM, Bo Borgerson [EMAIL PROTECTED] wrote:
 If all inputs are regular files then SORTERS read directly rather than
 being fed by an extra process.
 
 Does that work with multi-byte character sets?


Hi James,

Each sorter's portion of input is delineated along line boundaries as
detected by the main buffer-filling routine.  I don't think any
multi-byte character set problems should have been introduced.

What type of issue specifically concerns you?  I'm going to start
putting together some tests soon, and I'd like to include a test case
that would exercise the type of bug you have in mind.

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-25 Thread Bo Borgerson
Bo Borgerson wrote:
 Cons:
 - It's limited by the feeder.  If the sorters were able to read at their
 own pace I think this would scale better.
 - It uses N+2 processes.  When sorters are run in parallel there are two
 helper processes, one feeding input and one merging output.

Hello again.

In light of the drawbacks mentioned for my previous parallel sort patch,
I've made the attached modifications.

If all inputs are regular files then SORTERS read directly rather than
being fed by an extra process.

This exhibits better performance with a concurrency of 2, but still does
not realize the full benefit of greater concurrency that I was expecting:

-

$ for i in 0 1 2 3; do cat /dev/urandom | base64 | head -200 | cut
-da -f1  p$i; done

$ time ~/sort p0 p1 p2 p3  /dev/null

real0m31.444s

$ time ~/sort --concurrency=2 p0 p1 p2 p3  /dev/null

real0m16.908s

$ time ~/sort --concurrency=4 p0 p1 p2 p3  /dev/null

real0m15.353s

$ time ~/sort -m (~/sort p0 p1) (~/sort p2 p3)  /dev/null

real0m17.066s -- similar to --concurrency=2

$ time ~/sort -m (~/sort p0) (~/sort p1) (~/sort p2) (~/sort p3) 
/dev/null

real0m10.832s -- _this_ is what I want from --concurrency=4!

-

Jim pointed out a mistake in my performance testing script that was
causing me to use a smaller sample size for each data point than I
intended.  I've attached a version with his patch.

Thanks,

Bo
inline: bench_output_server_2.pngFrom 298d6871aee1a1a506015681fb88ce5ba2b24644 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Tue, 24 Jun 2008 14:02:22 -0400
Subject: [PATCH] o sort: Don't use a feeder for regular file concurrency.

* src/sort.c (xlseek): Try to lseek, complain on failure.  Stolen from src/tail.c.
(fillbuf): If SORTER_BYTES_LEFT is non-negative, treat it as a limit.
(sort): When running multiple sorters concurrently, if all inputs are regular
files set sorters up to read directly rather than spawning a feeder to distribute
work among them.
---
 src/sort.c |  220 ---
 1 files changed, 194 insertions(+), 26 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 18a8882..4aa7609 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -307,6 +307,11 @@ static unsigned int nmerge = NMERGE_DEFAULT;
their output through a pipe to the parent who will merge. */
 static int sorter_output_fd = -1;
 
+/* If multiple sorters are each reading their own input rather
+   than being fed by a single process then they'll have a cap
+   on how much they can read. */
+static size_t sorter_bytes_left = -1;
+
 static void sortlines_temp (struct line *, size_t, struct line *);
 
 /* Report MESSAGE for FILE, then clean up and exit.
@@ -831,6 +836,45 @@ xfclose (FILE *fp, char const *file)
 }
 }
 
+/* Call lseek with the specified arguments, where file descriptor FD
+   corresponds to the file, FILENAME.
+   Give a diagnostic and exit nonzero if lseek fails.
+   Otherwise, return the resulting offset.
+
+   This is stolen from src/tail.c */
+
+static off_t
+xlseek (int fd, off_t offset, int whence, char const *filename)
+{
+  off_t new_offset = lseek (fd, offset, whence);
+  char buf[INT_BUFSIZE_BOUND (off_t)];
+  char *s;
+
+  if (0 = new_offset)
+return new_offset;
+
+  s = offtostr (offset, buf);
+  switch (whence)
+{
+case SEEK_SET:
+  error (0, errno, _(%s: cannot seek to offset %s),
+	 filename, s);
+  break;
+case SEEK_CUR:
+  error (0, errno, _(%s: cannot seek to relative offset %s),
+	 filename, s);
+  break;
+case SEEK_END:
+  error (0, errno, _(%s: cannot seek to end-relative offset %s),
+	 filename, s);
+  break;
+default:
+  abort ();
+}
+
+  exit (EXIT_FAILURE);
+}
+
 static void
 dup2_or_die (int oldfd, int newfd)
 {
@@ -1556,7 +1600,8 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file)
   size_t line_bytes = buf-line_bytes;
   size_t mergesize = merge_buffer_size - MIN_MERGE_BUFFER_SIZE;
 
-  if (buf-eof)
+
+  if (buf-eof || 0 == sorter_bytes_left)
 return false;
 
   if (buf-used != buf-left)
@@ -1582,9 +1627,23 @@ fillbuf (struct buffer *buf, FILE *fp, char const *file)
 	 rest of the input file consists entirely of newlines,
 	 except that the last byte is not a newline.  */
 	  size_t readsize = (avail - 1) / (line_bytes + 1);
-	  size_t bytes_read = fread (ptr, 1, readsize, fp);
-	  char *ptrlim = ptr + bytes_read;
+	  size_t bytes_read;
+	  char *ptrlim;
 	  char *p;
+
+	  if (0  sorter_bytes_left)
+	readsize = MIN (readsize, sorter_bytes_left);
+
+	  bytes_read = fread (ptr, 1, readsize, fp);
+
+	  if (0  sorter_bytes_left)
+	sorter_bytes_left -= bytes_read;
+
+	  /* This is a fake end-of-file for this sorter child. */
+	  if (0 == sorter_bytes_left)
+	buf-eof = true;
+
+	  ptrlim = ptr + bytes_read;
 	  avail -= bytes_read;
 
 	  if (bytes_read != readsize)
@@ -2627,6 +2686,9 @@ sort (char * const *incoming_files, size_t nfiles

Re: who(1) exit status

2008-06-23 Thread Bo Borgerson
Andreas Schwab wrote:
 Eric Blake [EMAIL PROTECTED] writes:
 
 According to Shal-Linux-Ind on 6/23/2008 4:05 AM:
 | Hi,
 |
 | who(1) exit status is always 0.
 |
 | $ who --v
 | who (coreutils) 5.2.1

 Thanks for the report.  Consider upgrading - that is several years old,
 and the latest stable version is 6.12.  But I have confirmed that the
 issue still exists in git beyond 6.12:

 $ who /nosuch/file; echo $?
 0
 
 See the comment in read_utmp:
 
   /* Ignore the return value for now.
  Solaris' utmpname returns 1 upon success -- which is contrary
  to what the GNU libc version does.  In addition, older GNU libc
  versions are actually void.   */
   UTMP_NAME_FUNCTION (file);
 
 When using the utmpname/setutent/getutmp family of functions there
 really is no way to check for errors reading the file, since utmpname
 does not actually try to open it, and setutent has no return value.

Hi,

So it sounds like there's no portable way to distinguish between:

1. an error trying to look up information
2. no information to be found

Would it make sense, though, to return a nonzero exit code when output
is empty in either case?

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-22 Thread Bo Borgerson
Paul Eggert wrote:
 Bo Borgerson [EMAIL PROTECTED] writes:
 
 Does this sound like a step in the right direction for sort?  If I were
 to clean this up and submit it would you be willing to assess its
 viability as a portable improvement?
 
 Yes, and yes.  And thanks!

Hi Paul,

When I isolated my parallel merge enhancement I discovered that the
improvement I was seeing was mostly the result of my not having properly
divided resources (particularly NMERGE) among children.  Aside from some
benefit to unique merges with many duplicates I wasn't able to produce
satisfactory results using this approach.

So I started from scratch on a parallel bulk sort enhancement.  Here I
was able to see some modest but reliable improvement.

The approach I took was to divide the main work among a number of
children (sorters) by using an additional child (the feeder) to read
inputs and distribute data among them.  The parent then merges sorter
output.

This approach has some pros and cons in my view:

Pros:
- It's simple.  Sorters don't need to worry about what their siblings
are doing.  They just process the data they're fed.
- It doesn't require a known amount of data.  Work is distributed among
sorters by the feeder in small interleaved chunks.
- It doesn't require regular files.  Data coming through FIFOs or from
sub-pipelines via process substitution is no problem.
- It limits increased resource consumption.  The feeder is the only
process reading from disk until/unless the sorters need temporary files.
 NMERGE and SORT_BUFFER_SIZE are divided among sorters.

Cons:
- It's limited by the feeder.  If the sorters were able to read at their
own pace I think this would scale better.
- It uses N+2 processes.  When sorters are run in parallel there are two
helper processes, one feeding input and one merging output.

I've attached the results of some performance testing I did both on my
laptop (dual core) and on a server (4x dual core, hyper-threaded == 16
processors visible).  I included two graphs of the server results
which I thought were interesting.  Each line represents a level of
concurrency.  One graph shows time in seconds on the Y axis while the
other shows percentage of single-process time.  In both cases lower
lines indicate better performance.  As you can see, even on the 16
processor machine performance peaks at a low concurrency.

I included the script I used for testing in case anyone else is
interested and has a machine they're willing to run hot for a few hours.

The attached patch is also available for fetch from
git://repo.or.cz/coreutils/bo.git as branch 'sort'.

I haven't included any tests or documentation in the patch yet.  I was
hoping to first get a sense of whether you and other more experienced
coreutils developers consider this alternate approach to be worth pursuing.

Thanks,

Bo
inline: bench_output_server_seconds.pnginline: bench_output_server_percentage.pngFrom e31f3f11a2d06079182ae7892e3af280dc4044cc Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 18 Jun 2008 09:59:46 -0400
Subject: [PATCH] sort: Add new option --concurrency=N.

* src/sort.c (xfopen): Take an additional argument, FD. If FILE is NULL
then fdopen FD instead.  If FD is -1, use STDOUT as before.
(specify_concurrency): Process the --concurrency=N option value.
(check): Use new xfopen calling convention.  Pass -1 for FD.
(mergefps): Use new xfopen calling convention.  Pass the FILE's FD for input
and SORTER_OUTPUT_FD for output.
(sort): If MAX_CONCURRENCY allows, try to fork off SORTER children.
Fork off a final child (the FEEDER) to read inputs and distribute among
SORTER children.  Merge SORTER output in the parent.
---
 src/sort.c |  288 ++-
 1 files changed, 263 insertions(+), 25 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 2039dab..18a8882 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -101,6 +101,11 @@ enum
 
 enum
   {
+/* The number of times we should try to fork a child to help with
+   a large sort.  We can always sort everything ourselves if need
+   be so this number can be small. */
+MAX_FORK_TRIES_SORT = 2,
+
 /* The number of times we should try to fork a compression process
(we retry if the fork call fails).  We don't _need_ to compress
temp files, this is just to reduce disk access, so this number
@@ -223,6 +228,10 @@ static struct month monthtab[] =
   {SEP, 9}
 };
 
+/* How much data a parallel sort feeder will give to each sorter
+   at a time. */
+#define FEEDER_BUF_SIZE 65536
+
 /* During the merge phase, the number of files to merge at once. */
 #define NMERGE_DEFAULT 16
 
@@ -285,10 +294,19 @@ static struct keyfield *keylist;
 /* Program used to (de)compress temp files.  Must accept -d.  */
 static char const *compress_program;
 
+/* Maximum number of sorters that may be run in parallel.
+   This can be modified on the command line with the --concurrency
+   option. */
+static unsigned

sort --batch-size non-merge bug

2008-06-19 Thread Bo Borgerson
Hi,

I'm embarrassed to say that I've discovered a bug in the recently added
--batch-size option of sort.

If --batch-size is used with a non-merge sort (to govern the merge of
temp files), and there is no --buffer-size set in conjunction, then the
minimum SORT_SIZE will be enforced resulting in severe performance
degradation.

I've attached a fix for this bug, including a test that exercises it.
I've also pushed to repo.or.cz.

Sorry for introducing this.

Thanks,

Bo
From 91aa3fb5a2636dc918bafa67f3a097d646cac075 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Thu, 19 Jun 2008 15:37:21 -0400
Subject: [PATCH] sort: Fix bug where --batch-size option shrank SORT_SIZE.

* src/sort.c (specify_nmerge, main): Only adjust SORT_SIZE if it's already set.
* tests/misc/sort-merge: Test bug fix.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 src/sort.c|   14 ++
 tests/misc/sort-merge |7 +++
 2 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/src/sort.c b/src/sort.c
index 1393521..2039dab 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -1105,14 +1105,7 @@ specify_nmerge (int oi, char c, char const *s)
 	  e = LONGINT_OVERFLOW;
 	}
 	  else
-	{
-	  /* Need to re-check that we meet the minimum
-		 requirement for memory usage with the new,
-		 potentially larger, nmerge. */
-	  sort_size = MAX (sort_size, MIN_SORT_SIZE);
-
-	  return;
-	}
+	return;
 	}
 }
 
@@ -3320,6 +3313,11 @@ main (int argc, char **argv)
   files = minus;
 }
 
+  /* Need to re-check that we meet the minimum requirement for memory
+ usage with the final value for NMERGE. */
+  if (0  sort_size)
+sort_size = MAX (sort_size, MIN_SORT_SIZE);
+
   if (checkonly)
 {
   if (nfiles  1)
diff --git a/tests/misc/sort-merge b/tests/misc/sort-merge
index a2524c4..fb7c63c 100755
--- a/tests/misc/sort-merge
+++ b/tests/misc/sort-merge
@@ -27,6 +27,8 @@ my $prog = 'sort';
 # three empty files and one that says 'foo'
 my @inputs = (+(map{{IN= {empty$_= ''}}}1..3), {IN= {foo= foo\n}});
 
+my $big_input = aaa\n x 1024;
+
 # don't need to check for existence, since we're running in a temp dir
 my $badtmp = 'does/not/exist';
 
@@ -66,6 +68,11 @@ my @Tests =
  ['nmerge-no', -m --batch-size=2 -T$badtmp, @inputs,
 {ERR_SUBST=s|: $badtmp/sort.+||},
 {ERR=$prog: cannot create temporary file\n}, {EXIT=2}],
+
+ # This used to fail because setting batch-size without also setting
+ # buffer size would cause the buffer size to be set to the minimum.
+ ['batch-size', --batch-size=16 -T$badtmp, {IN= {big= $big_input}},
+	{OUT=$big_input}],
 );
 
 my $save_temps = $ENV{DEBUG};
-- 
1.5.4.3

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-06-18 Thread Bo Borgerson
Simon Josefsson wrote:
 Christopher Kerr [EMAIL PROTECTED] writes:
 
 After being burned by using `head -c6 /dev/urandom | base64` as part of a 
 directory name, I realised that it would be useful if base64 had an option 
 to 
 generate URL and Filename safe encodings, as specified in RFC 3548 section 4.

 This would make
 cat FILE | base64 --filename-safe
 equivalent to
 cat FILE | base64 | tr '+/' '-_'
 using the current coreutils tools.
 
 I think --filename-safe is a good idea.  The documentation should
 discuss the potential for generating files starting with '-' or '--'.
 Patching gnulib's base64.c to support an arbitrary alphabet seems messy.
 Patches welcome though.

Hi Simon,

I thought I'd take a stab at this and see where it goes.

What I've done is exposed an additional set of functions, *_a, which
take an arbitrary alphabet as an extra parameter.  Each historical
function now calls one of these with the 'main' alphabet.  I then added
a parallel set of functions, *_filesafe, which call the *_a functions
with the alphabet described above.

It is a little messy, I think, because the large hand-initialized
data-structures are duplicated.  The messiness could be reduced by
having base64 just expose the *_a interface for using an arbitrary
alphabet, and adding a second module (base64_filesafe?) that provided
that specific alternate (with all its attendant bulk).

In any case, as with my previous patches I've tried not to alter the
behavior of any already existing functions.

I've also attached a small patch against coreutils' base64 utility that
provides the desired behavior.  There are no documentation/tests/etc
yet.  It's only for demonstration purposes.

How does this look to you?

Thanks,

Bo
From fcb70d9fdd1c7979f0e3ee499a48248fd771 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 18 Jun 2008 19:16:01 -0400
Subject: [PATCH] base64: Provide an interface for alphabet configurationi and a filesafe alphabet.

* lib/base64.c (base64_encode_a): Was base64_encode.  Takes an alphabet.
(base64_encode_alloc_a): Was base64_encode_alloc. Takes an alphabet.
(isbase64_a): Was isbase64.  Takes an alphabet.
(isbase64 isbase64_filesafe): Call isbase64_a with appropriate alphabet.
(decode_4): Takes an alphabet.
(base64_decode_ctx_a): Was base64_decode_ctx. Takes an alphabet.
(base64_decode_alloc_ctx_a): Was base64_decode_alloc_ctx. Takes an alphabet.
* lib/base64.h (base64_encode): Now a wrapper around base64_encode_a.
(base64_encode_filesafe): Likewise.
(base64_encode_alloc): Now a wrapper around base64_encode_alloc_a.
(base64_encode_alloc_filesafe): Likewise.
(base64_decode_ctx): Now a wrapper around base64_decode_ctx_a.
(base64_decode_ctx_filesafe): Likewise.
(base64_decode): Likewise.
(base64_decode_alloc_ctx): Now a wrapper around base64_decode_alloc_ctx_a.
(base64_decode_alloc_ctx_filesafe): Likewise.
(base64_decode_alloc): Likewise.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 lib/base64.c |  327 ++---
 lib/base64.h |   54 --
 2 files changed, 287 insertions(+), 94 deletions(-)

diff --git a/lib/base64.c b/lib/base64.c
index 8aff430..01baa62 100644
--- a/lib/base64.c
+++ b/lib/base64.c
@@ -61,17 +61,22 @@ to_uchar (char ch)
   return ch;
 }
 
+const char b64str_main[64] =
+  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
+
+const char b64str_filesafe[64] =
+  ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_;
+
+
 /* Base64 encode IN array of size INLEN into OUT array of size OUTLEN.
If OUTLEN is less than BASE64_LENGTH(INLEN), write as many bytes as
possible.  If OUTLEN is larger than BASE64_LENGTH(INLEN), also zero
terminate the output buffer. */
 void
-base64_encode (const char *restrict in, size_t inlen,
-	   char *restrict out, size_t outlen)
+base64_encode_a (const char *restrict in, size_t inlen,
+		 char *restrict out, size_t outlen,
+		 const char *b64str)
 {
-  static const char b64str[64] =
-ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/;
-
   while (inlen  outlen)
 {
   *out++ = b64str[(to_uchar (in[0])  2)  0x3f];
@@ -113,7 +118,8 @@ base64_encode (const char *restrict in, size_t inlen,
indicates length of the requested memory block, i.e.,
BASE64_LENGTH(inlen) + 1. */
 size_t
-base64_encode_alloc (const char *in, size_t inlen, char **out)
+base64_encode_alloc_a (const char *in, size_t inlen, char **out,
+		   const char *b64str)
 {
   size_t outlen = 1 + BASE64_LENGTH (inlen);
 
@@ -153,7 +159,7 @@ base64_encode_alloc (const char *in, size_t inlen, char **out)
 
IBM C V6 for AIX mishandles #define B64(x) ...'x'..., so use _
as the formal parameter rather than x.  */
-#define B64(_)	\
+#define B64M(_)	\
   ((_) == 'A' ? 0\
: (_) == 'B' ? 1\
: (_) == 'C' ? 2\
@@ -220,71 +226,206 @@ base64_encode_alloc (const char *in, size_t inlen, char **out)
: (_) == '/' ? 63

Re: rebased patches?

2008-06-16 Thread Bo Borgerson
Jim Meyering wrote:
 This brings up another (as yet unwritten) guideline:
   Don't change translatable strings if you can avoid it.
   If you must rearrange lines, extract and create new strings, rather than
   extracting and moving into existing blocks.  This avoids making unnecessary
   work for translators.


Hi Jim,

I've pushed a version of the sort branch that contains the following
updates:

1. Try to minimize changes to translatable strings
2. Improve diagnostic messages for files0-from edge-cases
3. Use the new standardized files0-from test script format
4. Avoid use of the '' operator
5. Follow the log message summary template in HACKING


 
 Yes, I'll add this to HACKING RSN ;-)


If it helps I've pushed a branch called HACKING that adds a new section
for translatability tips and includes your guideline above verbatim. ;)

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-16 Thread Bo Borgerson
Paul Eggert wrote:
 [EMAIL PROTECTED] wrote:
 I think it is good idea to make option(or by default) for sorting
 in threads to increase performance on systems that might execute
 more than one thread in parallel.
Klimentov Konstantin.
 
 I agree.  That's been on my to-do list for years.  (It shouldn't be
 that hard, if you ignore portability hassles.  :-)
 

Hi Paul,

I've modified my local sort to parallelize large merges by dividing
inputs among a number of children whose outputs are merged by the parent.

This only benefits large bulk sorts indirectly by parallelizing the
merge of temp files, but it can still provide a performance improvement.

I suspect my implementation does ignore some portability hassles, but
only because I haven't encountered them yet. :)

Does this sound like a step in the right direction for sort?  If I were
to clean this up and submit it would you be willing to assess its
viability as a portable improvement?

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Multi-threading in sort(or core-utils)

2008-06-13 Thread Bo Borgerson
[EMAIL PROTECTED] wrote:
 Hello
 Few minutes ago i used sort -u for sorting big file(236 Mb). I have 2 core 
 cpu(core 2 duo), but i found that sort use only one cpu(work in one thread). 
 I think it is good idea to make option(or by default) for sorting in threads 
 to increase performance on systems that might execute more than one thread in 
 parallel.
Klimentov Konstantin.

Hi,

If you're using a shell that supports process substitution you could try
splitting your file in half and putting the bulk sorts of each half as
inputs to a merge:

So if you were doing:

$ sort bigfile

You could do:

$ sort -m (sort bigfile.firsthalf) (sort bigfile.secondhalf)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: rebased patches?

2008-06-12 Thread Bo Borgerson
Jim Meyering wrote:
 Also, I made some syntactic changes to fit with my policy
 preferences (no  operators, and adjusted const placement):

Thanks Jim.

BTW - Do you have these policy preferences collected somewhere?  I don't
remember seeing some of them in the general GNU standards document.  If
they are coreutils-specific would it make sense to have a section in
HACKING (or maybe a dedicated standards.txt supplement) to explain them?

I think that could be a useful educational tool for relatively
inexperienced contributors (like me), and could help reduce the noise in
patches that are submitted to the list.

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [OT] Is od broken?

2008-06-12 Thread Bo Borgerson
Eric Blake wrote:
 OK, I'll keep them as separate commits.  Bo inspired me, and I finally 
 figured 
 out how to use repo.or.cz.  Now you can do:
 git fetch git://repo.or.cz/coreutils/ericb.git refs/heads/od
 
 to see my patch series.


Awesome!

That was actually Jim's suggestion, but I'm glad to see it get more use.  :)

BTW - Another nice way to fetch your branch might be:

$ git fetch git://repo.or.cz/coreutils/ericb.git od:od

Which creates a local `od' branch with your patch series.

If I already had an `od' branch of my own, I could specify an alternate
local branch name:

$ git fetch git://repo.or.cz/coreutils/ericb.git od:od-ericb


Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: du v5.93: traverses subdirectories although --separate-dirs and --summarize are set?

2008-05-29 Thread Bo Borgerson
Volker Badziong wrote:
 Hello,
 
 I am running du (GNU coreutils) 5.93. When executing e.g.
 
  du --separate-dirs --summarize  /etc/
 
 you are only interested in the total space consumed by stuff in /etc/, not 
 within any subfolders. But nevertheless du traverses all subdirectories, 
 regardless if --summarize is set or not. This has no effect on the produced 
 output / numeric result for /etc/.
 
 Is there a reason the traversal still happens? This causes a lot of (in my 
 humble opinion) unncessary IO.
 
 Here is a sample output of running with and without --summarize. Numbers for 
 /etc/ are identical, but IO happens in both cases the same.
 
 somehost:/ # du  --separate-dirs --block-size=1   /etc/
 94208   /etc/udev/rules.d
 ...
 3076096 /etc/
 
 somehost:/ # du  --separate-dirs --block-size=1  --summarize  /etc/
 3076096 /etc/


I think du actually has to traverse the whole tree in both cases.  The
difference as I understand it with `--summarize' is that information is
only _printed_ for the top level.

Bo



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-08 Thread Bo Borgerson
Bruno Haible wrote:
 If you want wc to count characters after canonicalization, then you can
 invent a new wc command-line option for it. But I would find it more useful
 to have a filter program that reads from standard input and writes the
 canonicalized output to standard output; that would be applicable in many
 more situations.


I like the sound of that!

I suppose the not-yet-implemented gnulib Unicode normalization library
you mentioned in another post would be a prerequisite for such a tool.

I'm definitely interested in helping out here, but I think someone with
a more thorough understanding of Unicode would probably be more useful
(Pádraig?)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-07 Thread Bo Borgerson
Pádraig Brady wrote:
 canonically équivalent
 canonically équivalent
 
 Pádraig.
 
 p.s. I Notice that gnome-terminal still doesn't handle
 combining characters correctly, and my mail client thunderbird
 is putting the accent on the q rather than the e, sigh.

They both render correctly here (Thunderbird 2.0.0.12).

Is there a good library for combining-character canonicalization
available?  That seems like something that would be useful to have in a
lot of text-processing tools.  Also, for Unicode, something to shuffle
between the normalization forms might be helpful for comparisons.

I may be misinterpreting your patch, but it seems to me that
decrementing count for zero-width characters could potentially lead to
confusion.  Not all zero-width characters are combining characters, right?

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: horrible utf-8 performace in wc

2008-05-07 Thread Bo Borgerson
Jim Meyering wrote:
 Bo Borgerson [EMAIL PROTECTED] wrote:
 I may be misinterpreting your patch, but it seems to me that
 decrementing count for zero-width characters could potentially lead to
 confusion.  Not all zero-width characters are combining characters, right?
 
 It looks ok to me, since there's an unconditional increment
 
 chars++;
 
 about 25 lines above, so the decrement would just undo that.


Right, I guess my question is more about the semantics of `wc -m'.
Should stand-alone zero-width characters such as the zero-width space be
counted?

The attached (UTF-8) file contains 3 characters according to HEAD, but
only two with the patch.

Bo
a​b___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-05-05 Thread Bo Borgerson
Jim Meyering wrote:
 I found strict_newlines to be a little unclear.
 If you use something like ignore_newlines instead, that's not
 only clearer to me, but with its reversed semantics it also lets
 you avoid three negations.


Thanks, that's much nicer.

The attached patch contains this change and is rebased against the
current HEAD.

I've also made this available via:

$ git fetch git://repo.or.cz/coreutils/bo.git base64-merge:base64-merge


Thanks,

Bo
From 9131d82c32e00b606eb79d083ef8309178460ac5 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 30 Apr 2008 17:40:38 -0400
Subject: [PATCH] An upstream compatible base64

* gl/lib/base64.c (base64_decode_ctx): If no context structure was passed in,
treat newlines as garbage (this is the historical behavior).  Formerly
base64_decode.
(base64_decode_alloc_ctx): Formerly base64_decode_alloc.
* gl/lib/base64.h (base64_decode): Macro for four-argument calls.
(base64_decode_alloc): Likewise.
* src/base64.c (do_decode): Call base64_decode_ctx instead of base64_decode.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 gl/lib/base64.c |   45 +++--
 gl/lib/base64.h |   19 +--
 src/base64.c|2 +-
 3 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/gl/lib/base64.c b/gl/lib/base64.c
index 43f12c6..a33f102 100644
--- a/gl/lib/base64.c
+++ b/gl/lib/base64.c
@@ -449,20 +449,32 @@ decode_4 (char const *restrict in, size_t inlen,
Initially, CTX must have been initialized via base64_decode_ctx_init.
Subsequent calls to this function must reuse whatever state is recorded
in that buffer.  It is necessary for when a quadruple of base64 input
-   bytes spans two input buffers.  */
+   bytes spans two input buffers.
+
+   If CTX is NULL then newlines are treated as garbage and the input
+   buffer is processed as a unit.  */
 
 bool
-base64_decode (struct base64_decode_context *ctx,
-	   const char *restrict in, size_t inlen,
-	   char *restrict out, size_t *outlen)
+base64_decode_ctx (struct base64_decode_context *ctx,
+		   const char *restrict in, size_t inlen,
+		   char *restrict out, size_t *outlen)
 {
   size_t outleft = *outlen;
-  bool flush_ctx = inlen == 0;
+  bool ignore_newlines = ctx != NULL;
+  bool flush_ctx = false;
+  unsigned int ctx_i = 0;
+
+  if (ignore_newlines)
+{
+  ctx_i = ctx-i;
+  flush_ctx = inlen == 0;
+}
+
 
   while (true)
 {
   size_t outleft_save = outleft;
-  if (ctx-i == 0  !flush_ctx)
+  if (ctx_i == 0  !flush_ctx)
 	{
 	  while (true)
 	{
@@ -482,7 +494,7 @@ base64_decode (struct base64_decode_context *ctx,
 
   /* Handle the common case of 72-byte wrapped lines.
 	 This also handles any other multiple-of-4-byte wrapping.  */
-  if (inlen  *in == '\n')
+  if (inlen  *in == '\n'  ignore_newlines)
 	{
 	  ++in;
 	  --inlen;
@@ -495,12 +507,17 @@ base64_decode (struct base64_decode_context *ctx,
 
   {
 	char const *in_end = in + inlen;
-	char const *non_nl = get_4 (ctx, in, in_end, inlen);
+	char const *non_nl;
+
+	if (ignore_newlines)
+	  non_nl = get_4 (ctx, in, in_end, inlen);
+	else
+	  non_nl = in;  /* Might have nl in this case. */
 
 	/* If the input is empty or consists solely of newlines (0 non-newlines),
 	   then we're done.  Likewise if there are fewer than 4 bytes when not
-	   flushing context.  */
-	if (inlen == 0 || (inlen  4  !flush_ctx))
+	   flushing context and not treating newlines as garbage.  */
+	if (inlen == 0 || (inlen  4  !flush_ctx  ignore_newlines))
 	  {
 	inlen = 0;
 	break;
@@ -529,9 +546,9 @@ base64_decode (struct base64_decode_context *ctx,
input was invalid, in which case *OUT is NULL and *OUTLEN is
undefined. */
 bool
-base64_decode_alloc (struct base64_decode_context *ctx,
-		 const char *in, size_t inlen, char **out,
-		 size_t *outlen)
+base64_decode_alloc_ctx (struct base64_decode_context *ctx,
+			 const char *in, size_t inlen, char **out,
+			 size_t *outlen)
 {
   /* This may allocate a few bytes too many, depending on input,
  but it's not worth the extra CPU time to compute the exact size.
@@ -544,7 +561,7 @@ base64_decode_alloc (struct base64_decode_context *ctx,
   if (!*out)
 return true;
 
-  if (!base64_decode (ctx, in, inlen, *out, needlen))
+  if (!base64_decode_ctx (ctx, in, inlen, *out, needlen))
 {
   free (*out);
   *out = NULL;
diff --git a/gl/lib/base64.h b/gl/lib/base64.h
index ba436e0..fa242c8 100644
--- a/gl/lib/base64.h
+++ b/gl/lib/base64.h
@@ -42,12 +42,19 @@ extern void base64_encode (const char *restrict in, size_t inlen,
 extern size_t base64_encode_alloc (const char *in, size_t inlen, char **out);
 
 extern void base64_decode_ctx_init (struct base64_decode_context *ctx);
-extern bool base64_decode (struct base64_decode_context *ctx,
-			   const char *restrict in, size_t inlen,
-			   char *restrict out, size_t *outlen);
 
-extern bool base64_decode_alloc

Re: Feature request - base64 Filename Safe Alphabet

2008-05-05 Thread Bo Borgerson
Simon Josefsson wrote:
 Your patch is rather difficult to read for me, since I'm not that
 familiar with the coreutils changes, and more importantly: to be applied
 to gnulib, I need a patch against gnulib.


Hi Simon,

Thanks for looking at this.


 Would you mind creating a patchset that applies to the gnulib git
 repository?


Not at all.

It wasn't very easy to read as a single revision, so I did it in two
steps.  The first step is pure addition: New functions and a definition
of the decode context structure.  The second step is still not the most
legible diff, but it should be a little easier to get your bearings in.


 I suspect your patch do things the way I suggested in the post to the
 gnulib list some time ago, which is nice.


Yes, I think so, at least in terms of interface.


Thanks again,

Bo
From 3a9bdc6228eba0645bb482f88502bdf19aff609f Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Mon, 5 May 2008 10:54:31 -0400
Subject: [PATCH] A coreutils compatible base64 - part 1

* lib/base64.c (get_4): Get four non-newline characters from the input buffer.
Use the context structure's buffer to create a contiguous block if necessary.
Currently unused.
(decode_4): Helper function to be used by base64_decode_ctx.  Currently unused.
(base64_decode_ctx_init): Initialize a decode context structure.
* lib/base64.h (struct base64_decode_context) To be used by base64_decode_ctx

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 lib/base64.c |  135 ++
 lib/base64.h |8 +++
 2 files changed, 143 insertions(+), 0 deletions(-)

diff --git a/lib/base64.c b/lib/base64.c
index f237cd6..40ae640 100644
--- a/lib/base64.c
+++ b/lib/base64.c
@@ -300,6 +300,141 @@ isbase64 (char ch)
   return uchar_in_range (to_uchar (ch))  0 = b64[to_uchar (ch)];
 }
 
+/* Initialize decode-context buffer, CTX.  */
+void
+base64_decode_ctx_init (struct base64_decode_context *ctx)
+{
+  ctx-i = 0;
+}
+
+/* If CTX-i is 0 or 4, there are four or more bytes in [*IN..IN_END), and
+   none of those four is a newline, then return *IN.  Otherwise, copy up to
+   4 - CTX-i non-newline bytes from that range into CTX-buf, starting at
+   index CTX-i and setting CTX-i to reflect the number of bytes copied,
+   and return CTX-buf.  In either case, advance *IN to point to the byte
+   after the last one processed, and set *N_NON_NEWLINE to the number of
+   verified non-newline bytes accessible through the returned pointer.  */
+static inline char *
+get_4 (struct base64_decode_context *ctx,
+   char const *restrict *in, char const *restrict in_end,
+   size_t *n_non_newline)
+{
+  if (ctx-i == 4)
+ctx-i = 0;
+
+  if (ctx-i == 0)
+{
+  char const *t = *in;
+  if (4 = in_end - *in  memchr (t, '\n', 4) == NULL)
+	{
+	  /* This is the common case: no newline.  */
+	  *in += 4;
+	  *n_non_newline = 4;
+	  return (char *) t;
+	}
+}
+
+  {
+/* Copy non-newline bytes into BUF.  */
+char const *p = *in;
+while (p  in_end)
+  {
+	char c = *p++;
+	if (c != '\n')
+	  {
+	ctx-buf[ctx-i++] = c;
+	if (ctx-i == 4)
+	  break;
+	  }
+  }
+
+*in = p;
+*n_non_newline = ctx-i;
+return ctx-buf;
+  }
+}
+
+#define return_false\
+  do		\
+{		\
+  *outp = out;\
+  return false;\
+}		\
+  while (false)
+
+/* Decode up to four bytes of base64-encoded data, IN, of length INLEN
+   into the output buffer, *OUT, of size *OUTLEN bytes.  Return true if
+   decoding is successful, false otherwise.  If *OUTLEN is too small,
+   as many bytes as possible are written to *OUT.  On return, advance
+   *OUT to point to the byte after the last one written, and decrement
+   *OUTLEN to reflect the number of bytes remaining in *OUT.  */
+static inline bool
+decode_4 (char const *restrict in, size_t inlen,
+	  char *restrict *outp, size_t *outleft)
+{
+  char *out = *outp;
+  if (inlen  2)
+return false;
+
+  if (!isbase64 (in[0]) || !isbase64 (in[1]))
+return false;
+
+  if (*outleft)
+{
+  *out++ = ((b64[to_uchar (in[0])]  2)
+		| (b64[to_uchar (in[1])]  4));
+  --*outleft;
+}
+
+  if (inlen == 2)
+return_false;
+
+  if (in[2] == '=')
+{
+  if (inlen != 4)
+	return_false;
+
+  if (in[3] != '=')
+	return_false;
+}
+  else
+{
+  if (!isbase64 (in[2]))
+	return_false;
+
+  if (*outleft)
+	{
+	  *out++ = (((b64[to_uchar (in[1])]  4)  0xf0)
+		| (b64[to_uchar (in[2])]  2));
+	  --*outleft;
+	}
+
+  if (inlen == 3)
+	return_false;
+
+  if (in[3] == '=')
+	{
+	  if (inlen != 4)
+	return_false;
+	}
+  else
+	{
+	  if (!isbase64 (in[3]))
+	return_false;
+
+	  if (*outleft)
+	{
+	  *out++ = (((b64[to_uchar (in[2])]  6)  0xc0)
+			| b64[to_uchar (in[3])]);
+	  --*outleft;
+	}
+	}
+}
+
+  *outp = out;
+  return true;
+}
+
 /* Decode base64 encoded input array IN of length INLEN to output
array OUT that can hold

[PATCH] base64: remove some unused/redundant getopt code

2008-05-05 Thread Bo Borgerson
Hi,

I noticed these when I was poking around in base64 recently.  Looks like
they're vestigial.

Bo
From 29288df82cd764b384bcd6535925c82ffca8ffc6 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Mon, 5 May 2008 21:58:28 -0400
Subject: [PATCH] base64: remove some unused/redundant getopt code

* src/base64.c (struct option long_option): Remove redundant help/version
option items.
(main): Remove unused 'q' from short options.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 src/base64.c |4 +---
 1 files changed, 1 insertions(+), 3 deletions(-)

diff --git a/src/base64.c b/src/base64.c
index 983b8cb..4a7e51f 100644
--- a/src/base64.c
+++ b/src/base64.c
@@ -44,8 +44,6 @@ static const struct option long_options[] = {
   {decode, no_argument, 0, 'd'},
   {wrap, required_argument, 0, 'w'},
   {ignore-garbage, no_argument, 0, 'i'},
-  {help, no_argument, 0, GETOPT_HELP_CHAR},
-  {version, no_argument, 0, GETOPT_VERSION_CHAR},
 
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -257,7 +255,7 @@ main (int argc, char **argv)
 
   atexit (close_stdout);
 
-  while ((opt = getopt_long (argc, argv, dqiw:, long_options, NULL)) != -1)
+  while ((opt = getopt_long (argc, argv, diw:, long_options, NULL)) != -1)
 switch (opt)
   {
   case 'd':
-- 
1.5.4.3

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [PATCH] Add new program: psub

2008-05-03 Thread Bo Borgerson
Bo Borgerson wrote:
 Hi,
 
 This program uses the temporary fifo management system that I built
 for zargs to provide generic process substitution for arguments to a
 sub-command.
 
 This program has some advantages over the process substitution built
 into some shells (bash, zsh, ksh, ???):
 
 1. It doesn't rely on having a shell that supports built-in process
 substitution.
 2. By using descriptively named temporary fifos it allows programs
 that include filenames in output or diagnostic messages to provide
 more useful information than with '/dev/fd/*' inputs.
 3. It supports `--files0-from=F' style argument passing, as well.
 
 Also available for fetch at:
 
 $ git fetch git://repo.or.cz/coreutils/bo.git psub:psub
 

Hi,

I'd like to share another use for this tool.

As discussed previously there is a performance penalty when using `sort
-m' for exceeding a certain number of inputs (NMERGE).  Any more inputs
and temporary files will be used, which increases both I/O and CPU cost.

One way to avoid this extra cost is to increase NMERGE.  Another would
be to use tributary processes that each merge a subset of inputs and
feed into the main merge.  This has a potential added advantage on
multi-processor machines of spreading the workload among processors.

In the following example I have 32 inputs (named `0'..`31') each with
1048576 records.  Each record is a single character and there are
obviously large contiguous blocks of identical records.  NMERGE is 16
(the default).


$ time sort -m *

real0m9.107s
user0m6.380s
sys 0m0.300s

$ time for i in 012 3456 789; do echo $i | sed 's/.*/sort -mu
*\[\]/'; done | xargs psub sort -m

real0m3.792s
user0m3.744s
sys 0m0.052s


And just to give a sense of how that breaks down:


$ for i in 012 3456 789; do echo $i | sed 's/.*/ls *\[\]/'; done |
xargs psub wc -l
 11 /tmp/psubsUegiv/ls *[012]
 12 /tmp/psubsUegiv/ls *[3456]
  9 /tmp/psubsUegiv/ls *[789]
 32 total


With longer records and no identical records in a given input the
benefit of spreading the work across processors becomes more apparent.
The following is with 64 files with 262144 records each.  Each record is
4 characters long.  I have a Core 2 Duo.


$ time sort -m *

real0m13.183s
user0m12.793s
sys 0m0.376s

$ time for i in 01 23 45 67 89; do echo $i | sed 's/.*/sort -mu
*\[\]/'; done | xargs psub sort -m

real0m6.660s
user0m12.401s
sys 0m0.168s

$ for i in 01 23 45 67 89; do echo $i | sed 's/.*/ls *\[\]/'; done |
xargs psub wc -l
 14 /tmp/psubG0UkXb/ls *[01]
 14 /tmp/psubG0UkXb/ls *[23]
 12 /tmp/psubG0UkXb/ls *[45]
 12 /tmp/psubG0UkXb/ls *[67]
 12 /tmp/psubG0UkXb/ls *[89]
 64 total


The multi-process benefit is amplified on machines with more available
processors.  With the current trend of increasing numbers of on-die
processor cores I think this sort of easy technique for taking advantage
of concurrency is going to become more broadly beneficial.

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: coreutils test coverage

2008-04-30 Thread Bo Borgerson
Daniel Dunbar wrote:
 Here is the process I use for generating those results. First, generate the 
 coverage information:


Thanks, that worked like a charm!

I've attached a patch that puts your instructions into the HACKING file.

I used a `.lcov' extension for the lcov output files instead of `.info',
since that extension is already used in the doc/ directory for a
different file format.

One nice further addition would be to have `make clean' also remove the
generated `.gcda' and `.gcno' files, but I'm going to show my
inexperience here and say I don't know how to do that safely. :)


 I also have an additional script which munges the lcov output to make the 
 tables
 sortable but this is pretty gross. When I get a chance I would prefer to push 
 this
 back to lcov as an extra option, although if you really want it I can pass it 
 on.


Yeah, that's probably better to send back upstream to lcov.  I'm sure
they'll appreciate that, and when it works its way down onto coreutils
hackers' boxes they'll appreciate it, too. :)

Thanks again!

Bo
From decc65cb8f2608743ae906cf4479dd084219ae5d Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 30 Apr 2008 08:49:59 -0400
Subject: [PATCH] Add Daniel Dunbar's lcov instructions to HACKING

* HACKING: New section `Finding things to do', points to TODO file and
gives instructions on generating an html coverage report as provided by
Daniel Dunbar.
* TODO: Add item for improving test coverage.  Point back to HACKING.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 HACKING |   27 +++
 TODO|3 +++
 2 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/HACKING b/HACKING
index b40ff00..c33bdd3 100644
--- a/HACKING
+++ b/HACKING
@@ -317,3 +317,30 @@ Miscellaneous useful git commands
   * git rebase -i master: run this from on a branch, and it gives
   you an interface with which you can reorder and modify arbitrary
   change sets on that branch.
+
+---
+
+Finding things to do
+
+If you don't know where to start, check out the TODO file for projects
+that look like they're at your skill-/interest-level.  Another good
+option is always to improve tests.  You never know what you might
+uncover when you improve test coverage, and even if you don't find
+any bugs your contribution is sure to be appreciated.
+
+A good way to quickly assess current test coverage is to use lcov
+to generate HTML coverage reports.  Follow these steps:
+
+  # configure with coverage information
+  ./configure CFLAGS=-g -fprofile-arcs -ftest-coverage
+  make
+  # run whatever tests you want, i.e.:
+  make check
+  # run lcov
+  lcov -t coreutils -q -d lib -b lib -o lib.lcov -c
+  lcov -t coreutils -q -d src -b src -o src.lcov -c
+  # generate HTML from the output
+  genhtml -p `pwd` -t coreutils -q --output-directory lcov-html *.lcov
+
+Then just open the index.html file (in the generated lcov-html directory)
+in your favorite web browser.
diff --git a/TODO b/TODO
index 86320b9..bda8de2 100644
--- a/TODO
+++ b/TODO
@@ -106,6 +106,9 @@ Remove suspicious uses of alloca (ones that may allocate more than
 Adapt these contribution guidelines for coreutils:
   http://sources.redhat.com/automake/contribute.html
 
+Improve test coverage.
+  See HACKING for instructions on generating an html test coverage report.
+  Find a program that has poor coverage and improve.
 
 Changes expected to go in, someday.
 ==
-- 
1.5.4.3

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-04-30 Thread Bo Borgerson
Jim Meyering wrote:
 Beware:
 there are two versions of base64.c.
 The one in gnulib and another in coreutils/gl/lib.
 
 Simon and I have been thinking about how to merge these
 two for some time, but I haven't found time since our last exchange.
 
 Volunteers welcome ;-)


Hi,

This is an attempt at making a base64.c that supports the context
structure for coreutils but still presents a four-argument decode
interface for gnulib.

It doesn't address the differences in newline handling, and it's
definitely less efficient for four-argument decode calls.  Is this the
direction you were thinking for a merge of the two?

Thanks,

Bo
From e63ed95710560a7da7f4fd681add4f0e8172bc7a Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 30 Apr 2008 17:40:38 -0400
Subject: [PATCH] A step toward an upstream compatible base64

* gl/lib/base64.c (base64_decode_ctx): If no context structure was passed in,
initialize a local one and use it.  Be sure to flush.  Formerly base64_decode.
(base64_decode_alloc_ctx): Formerly base64_decode_alloc.
* gl/lib/base64.h (base64_decode): Macro for four-argument calls.
(base64_decode_alloc): Likewise.
* src/base64.c (do_decode): Call base64_decode_ctx instead of base64_decode.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 gl/lib/base64.c |   22 +++---
 gl/lib/base64.h |   19 +--
 src/base64.c|2 +-
 3 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/gl/lib/base64.c b/gl/lib/base64.c
index 43f12c6..4a79eef 100644
--- a/gl/lib/base64.c
+++ b/gl/lib/base64.c
@@ -452,12 +452,20 @@ decode_4 (char const *restrict in, size_t inlen,
bytes spans two input buffers.  */
 
 bool
-base64_decode (struct base64_decode_context *ctx,
-	   const char *restrict in, size_t inlen,
-	   char *restrict out, size_t *outlen)
+base64_decode_ctx (struct base64_decode_context *ctx,
+		   const char *restrict in, size_t inlen,
+		   char *restrict out, size_t *outlen)
 {
   size_t outleft = *outlen;
   bool flush_ctx = inlen == 0;
+  struct base64_decode_context local_ctx;
+
+  if (ctx == NULL)
+{
+  ctx = local_ctx;
+  base64_decode_ctx_init (ctx);
+  flush_ctx = true;
+}
 
   while (true)
 {
@@ -529,9 +537,9 @@ base64_decode (struct base64_decode_context *ctx,
input was invalid, in which case *OUT is NULL and *OUTLEN is
undefined. */
 bool
-base64_decode_alloc (struct base64_decode_context *ctx,
-		 const char *in, size_t inlen, char **out,
-		 size_t *outlen)
+base64_decode_alloc_ctx (struct base64_decode_context *ctx,
+			 const char *in, size_t inlen, char **out,
+			 size_t *outlen)
 {
   /* This may allocate a few bytes too many, depending on input,
  but it's not worth the extra CPU time to compute the exact size.
@@ -544,7 +552,7 @@ base64_decode_alloc (struct base64_decode_context *ctx,
   if (!*out)
 return true;
 
-  if (!base64_decode (ctx, in, inlen, *out, needlen))
+  if (!base64_decode_ctx (ctx, in, inlen, *out, needlen))
 {
   free (*out);
   *out = NULL;
diff --git a/gl/lib/base64.h b/gl/lib/base64.h
index ba436e0..fa242c8 100644
--- a/gl/lib/base64.h
+++ b/gl/lib/base64.h
@@ -42,12 +42,19 @@ extern void base64_encode (const char *restrict in, size_t inlen,
 extern size_t base64_encode_alloc (const char *in, size_t inlen, char **out);
 
 extern void base64_decode_ctx_init (struct base64_decode_context *ctx);
-extern bool base64_decode (struct base64_decode_context *ctx,
-			   const char *restrict in, size_t inlen,
-			   char *restrict out, size_t *outlen);
 
-extern bool base64_decode_alloc (struct base64_decode_context *ctx,
- const char *in, size_t inlen,
- char **out, size_t *outlen);
+extern bool base64_decode_ctx (struct base64_decode_context *ctx,
+			   const char *restrict in, size_t inlen,
+			   char *restrict out, size_t *outlen);
+
+extern bool base64_decode_alloc_ctx (struct base64_decode_context *ctx,
+ const char *in, size_t inlen,
+ char **out, size_t *outlen);
+
+#define base64_decode(in, inlen, out, outlen) \
+	base64_decode_ctx (NULL, in, inlen, out, outlen)
+
+#define base64_decode_alloc(in, inlen, out, outlen) \
+	base64_decode_alloc_ctx (NULL, in, inlen, out, outlen)
 
 #endif /* BASE64_H */
diff --git a/src/base64.c b/src/base64.c
index aa2fc8f..983b8cb 100644
--- a/src/base64.c
+++ b/src/base64.c
@@ -223,7 +223,7 @@ do_decode (FILE *in, FILE *out, bool ignore_garbage)
 	  if (k == 1  ctx.i == 0)
 	break;
 	  n = BLOCKSIZE;
-	  ok = base64_decode (ctx, inbuf, (k == 0 ? sum : 0), outbuf, n);
+	  ok = base64_decode_ctx (ctx, inbuf, (k == 0 ? sum : 0), outbuf, n);
 
 	  if (fwrite (outbuf, 1, n, out)  n)
 	error (EXIT_FAILURE, errno, _(write error));
-- 
1.5.4.3

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-04-30 Thread Bo Borgerson
Jim Meyering wrote:
   http://thread.gmane.org/gmane.comp.lib.gnulib.bugs/8670/focus=12523
 
 Sorry I didn't dig that up initially.
 Since his packages are the main consumer other than coreutils,
 if you make him happy, I'll probably be happy, too ;-)


Ah, thanks, that helps put things in context.

It shouldn't be too hard to make the four-argument decode calls choke on
newlines in a backwards compatible way.  I'll look into it and submit an
updated patch.

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-04-30 Thread Bo Borgerson
Jim Meyering wrote:
 if you make him happy, I'll probably be happy, too ;-)


Hi Simon,

This is an attempt to merge the coreutils and gnulib base64 libraries.

My goal is to preserve the gnulib interface and behavior while also
supporting the coreutils extensions.

This version of the patch should have good performance in both cases, as
well.

Please let me know if this meets your requirements.

Thanks,

Bo
From a302f7beca7d0e2bfcb7770ff31947e3d2965db2 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 30 Apr 2008 17:40:38 -0400
Subject: [PATCH] An upstream compatible base64

* gl/lib/base64.c (base64_decode_ctx): If no context structure was passed in,
treat newlines as garbage (this is the historical behavior).  Formerly
base64_decode.
(base64_decode_alloc_ctx): Formerly base64_decode_alloc.
* gl/lib/base64.h (base64_decode): Macro for four-argument calls.
(base64_decode_alloc): Likewise.
* src/base64.c (do_decode): Call base64_decode_ctx instead of base64_decode.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 gl/lib/base64.c |   45 +++--
 gl/lib/base64.h |   19 +--
 src/base64.c|2 +-
 3 files changed, 45 insertions(+), 21 deletions(-)

diff --git a/gl/lib/base64.c b/gl/lib/base64.c
index 43f12c6..bfe4ad2 100644
--- a/gl/lib/base64.c
+++ b/gl/lib/base64.c
@@ -449,20 +449,32 @@ decode_4 (char const *restrict in, size_t inlen,
Initially, CTX must have been initialized via base64_decode_ctx_init.
Subsequent calls to this function must reuse whatever state is recorded
in that buffer.  It is necessary for when a quadruple of base64 input
-   bytes spans two input buffers.  */
+   bytes spans two input buffers.
+
+   If CTX is NULL then newlines are treated as garbage and the input
+   buffer is processed as a unit.  */
 
 bool
-base64_decode (struct base64_decode_context *ctx,
-	   const char *restrict in, size_t inlen,
-	   char *restrict out, size_t *outlen)
+base64_decode_ctx (struct base64_decode_context *ctx,
+		   const char *restrict in, size_t inlen,
+		   char *restrict out, size_t *outlen)
 {
   size_t outleft = *outlen;
-  bool flush_ctx = inlen == 0;
+  bool strict_newlines = ctx == NULL;
+  bool flush_ctx = false;
+  unsigned int ctx_i = 0;
+
+  if (!strict_newlines)
+{
+  ctx_i = ctx-i;
+  flush_ctx = inlen == 0;
+}
+
 
   while (true)
 {
   size_t outleft_save = outleft;
-  if (ctx-i == 0  !flush_ctx)
+  if (ctx_i == 0  !flush_ctx)
 	{
 	  while (true)
 	{
@@ -482,7 +494,7 @@ base64_decode (struct base64_decode_context *ctx,
 
   /* Handle the common case of 72-byte wrapped lines.
 	 This also handles any other multiple-of-4-byte wrapping.  */
-  if (inlen  *in == '\n')
+  if (inlen  *in == '\n'  !strict_newlines)
 	{
 	  ++in;
 	  --inlen;
@@ -495,12 +507,17 @@ base64_decode (struct base64_decode_context *ctx,
 
   {
 	char const *in_end = in + inlen;
-	char const *non_nl = get_4 (ctx, in, in_end, inlen);
+	char const *non_nl;
+
+	if (strict_newlines)
+	  non_nl = in;  /* Might have nl in this case. */
+	else
+	  non_nl = get_4 (ctx, in, in_end, inlen);
 
 	/* If the input is empty or consists solely of newlines (0 non-newlines),
 	   then we're done.  Likewise if there are fewer than 4 bytes when not
-	   flushing context.  */
-	if (inlen == 0 || (inlen  4  !flush_ctx))
+	   flushing context and not treating newlines as garbage.  */
+	if (inlen == 0 || (inlen  4  !flush_ctx  !strict_newlines))
 	  {
 	inlen = 0;
 	break;
@@ -529,9 +546,9 @@ base64_decode (struct base64_decode_context *ctx,
input was invalid, in which case *OUT is NULL and *OUTLEN is
undefined. */
 bool
-base64_decode_alloc (struct base64_decode_context *ctx,
-		 const char *in, size_t inlen, char **out,
-		 size_t *outlen)
+base64_decode_alloc_ctx (struct base64_decode_context *ctx,
+			 const char *in, size_t inlen, char **out,
+			 size_t *outlen)
 {
   /* This may allocate a few bytes too many, depending on input,
  but it's not worth the extra CPU time to compute the exact size.
@@ -544,7 +561,7 @@ base64_decode_alloc (struct base64_decode_context *ctx,
   if (!*out)
 return true;
 
-  if (!base64_decode (ctx, in, inlen, *out, needlen))
+  if (!base64_decode_ctx (ctx, in, inlen, *out, needlen))
 {
   free (*out);
   *out = NULL;
diff --git a/gl/lib/base64.h b/gl/lib/base64.h
index ba436e0..fa242c8 100644
--- a/gl/lib/base64.h
+++ b/gl/lib/base64.h
@@ -42,12 +42,19 @@ extern void base64_encode (const char *restrict in, size_t inlen,
 extern size_t base64_encode_alloc (const char *in, size_t inlen, char **out);
 
 extern void base64_decode_ctx_init (struct base64_decode_context *ctx);
-extern bool base64_decode (struct base64_decode_context *ctx,
-			   const char *restrict in, size_t inlen,
-			   char *restrict out, size_t *outlen);
 
-extern bool base64_decode_alloc (struct base64_decode_context *ctx,
- const

Re: Feature request - base64 Filename Safe Alphabet

2008-04-29 Thread Bo Borgerson
Christopher Kerr wrote:
 After being burned by using `head -c6 /dev/urandom | base64` as part of a 
 directory name, I realised that it would be useful if base64 had an option to 
 generate URL and Filename safe encodings, as specified in RFC 3548 section 4.
 
 This would make
 cat FILE | base64 --filename-safe
 equivalent to
 cat FILE | base64 | tr '+/' '-_'
 using the current coreutils tools.

Hi,

lib/base64.c looks fairly easy to pull apart so that current functions
base64_encode and base64_decode become wrappers around internal
functions that take an additional argument describing the alphabet.

New functions base64_encode_filesafe and base64_decode_filesafe could
then be added without breaking the pre-existing interface or duplicating
a lot of code.

B


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-04-29 Thread Bo Borgerson
Pádraig Brady wrote:
 Perhaps `tr '+/' '._'` would be better so that
 you don't need to worry about - at the start of a filename?


I'm think `.' at the beginning of a filename also has the potential to
give users unexpected behavior.

Bo



___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-04-29 Thread Bo Borgerson
Pádraig Brady wrote:
 tr '+/' '._' = hidden files
 tr '+/' '-_' = awkward option clashes
 tr '/' '_' = not POSIX portable
 
 ho hum, the awkward option clashes is probably best.

Yeah, there's no really ideal option, is there...

It almost might be nice to have a totally user-configurable alphabet.

Something like:

$ base64 --62=- --63=_


Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Feature request - base64 Filename Safe Alphabet

2008-04-29 Thread Bo Borgerson
Gabriel Barazer wrote:
 AFAIK, POSIX filenames allow any character except the slash character
 and the null byte.

 Especially when this is the RFC recommanded translation. This would
 avoid confusing people with multiple translation sets and stick to the
 RFC (considered by many as the authoritative translation)
 
 it is very easy to escape a dash character, either manually (the tab key
 makes it very easy with some shells), or in scripts (all languages have
 a shell escape function).
 
 IMHO this is a bad idea because this would confuse even more people
 trying to use it. We could end with dozen of incompatible, non-portable
 shell scripts, with none using the same translation set.
 
 A totally user-configurable alphabet is always possible with base64 |
 tr which is designed to do that.


Yes, you're absolutely right.  All very good points.

I do still think the original poster's suggestion of a `--filename-safe'
option is worth considering.  As you mentioned the inclusion of such a
base64 alphabet in the RFC means it's likely to be a widely accepted
alternative.


Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: coreutils test coverage

2008-04-29 Thread Bo Borgerson
Jim Meyering wrote:
 If you're reading this list, you probably noticed that some kind
 souls at Stanford uncovered a surprising number of bugs in coreutils
 recently.  Part of their analysis was coverage-related, and they
 produced these coverage reports:
 
 http://keeda.stanford.edu/~cristic/coreutils-dev-tests/src/
 
 In case anyone is interested in improving test coverage,
 that gives some obvious starting points.


How cool!

That's a really useful tool.  I wonder if it might be possible to
include some instructions for producing a coverage report like that in
the project somewhere... maybe in the HACKING file?

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: FYI, 11 mostly-test-related patches

2008-04-28 Thread Bo Borgerson
\n,  1\n,  0],
  -['a9', '-l',  x\ny\n, 2\n,  0],
  -['b0', '',,   0 0 0\n,  0],
  -['b1', '',a b\nc\n,   2 3 6\n,  0],
  -['c0', '-L',  1\n12\n,2\n,  0],
  -['c1', '-L',  1\n123\n1\n,3\n,  0],
  -['c2', '-L',  \n123456,   6\n,  0],
  -);
  -
  -sub test_vector
  -{
  -  my $t;
  -  foreach $t (@tv)
  -{
  -  my ($test_name, $flags, $in, $exp, $ret) = @$t;
  -  # By default, test both stdin-redirection and input from a pipe.
  -  $Test::input_via{$test_name} = {REDIR = 0, PIPE = 0};
  -
  -  # But if test name ends with `-file', test only with file arg(s).
  -  # FIXME: unfortunately, invoking wc like `wc FILE' makes it put
  -  # FILE in the ouput -- and FILE is different depending on $srcdir.
  -  $Test::input_via{$test_name} = {FILE = 0}
  -if $test_name =~ /-file$/;
  -
  -  # Now that `wc FILE' (note, with no options) produces results
  -  # different from `cat FILE|wc', disable those two `PIPE' tests.
  -  $flags eq ''
  -   and delete $Test::input_via{$test_name}-{PIPE};
  -}
  -
  -  return @tv;
  -}
  -
  -1;
  --
  1.5.5.1.68.gbdcd8


  ___
  Bug-coreutils mailing list
  Bug-coreutils@gnu.org
  http://lists.gnu.org/mailman/listinfo/bug-coreutils

From dd8e78633f60a4a266b870326ac87d9844dab02b Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Mon, 28 Apr 2008 10:30:22 -0400
Subject: [PATCH] Only cleanup test dirs from the process that created them.

* tests/CuTmpdir.pm (import): Use closure around current PID to avoid cleanup races.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 tests/CuTmpdir.pm |   27 ++-
 1 files changed, 14 insertions(+), 13 deletions(-)

diff --git a/tests/CuTmpdir.pm b/tests/CuTmpdir.pm
index f9d2c00..84312a0 100644
--- a/tests/CuTmpdir.pm
+++ b/tests/CuTmpdir.pm
@@ -52,18 +52,6 @@ sub chmod_tree
   find ($options, '.');
 }
 
-sub on_sig_remove_tmpdir
-{
-  my ($sig) = @_;
-  if (defined $dir)
-{
-  chmod_tree;
-  File::Temp::cleanup;
-}
-  $SIG{$sig} = 'DEFAULT';
-  kill $sig, $$;
-}
-
 sub import {
   my $prefix = $_[1];
 
@@ -82,9 +70,22 @@ sub import {
 or skip_test $prefix;
   $prefix = $1;
 
+  my $original_pid = $$;
+
+  my $on_sig_remove_tmpdir = sub {
+my ($sig) = @_;
+if ($$ == $original_pid and defined $dir)
+  {
+	chmod_tree;
+	File::Temp::cleanup;
+  }
+$SIG{$sig} = 'DEFAULT';
+kill $sig, $$;
+  };
+
   foreach my $sig (qw (INT TERM HUP))
 {
-  $SIG{$sig} = \on_sig_remove_tmpdir;
+  $SIG{$sig} = $on_sig_remove_tmpdir;
 }
 
   $dir = File::Temp::tempdir($prefix.tmp-, CLEANUP = 1 );
-- 
1.5.4.3

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: FYI, 11 mostly-test-related patches

2008-04-28 Thread Bo Borgerson
Jim Meyering wrote:
 Bo Borgerson [EMAIL PROTECTED] wrote:
 I think File::Temp does this internally as well, but it looks like
 chmod_tree will just warn about the failed chdir and procede to
 recursively chmod whatever directory it was in at the time if $dir is
 yanked out from under it.
 
 Yes.  Good catch.  It should obviously skip the find in that case.
 Want to write the patch?


Sure, I think this should do it.

Thanks,

Bo
From 769e662c3643c0b3dc21e96dec2e2f1cd481fb70 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Mon, 28 Apr 2008 13:11:26 -0400
Subject: [PATCH] tests: don't chmod after a failed chdir in cleanup

* tests/CuTmpdir.pm (chmod_tree): Don't chmod if chdir failed.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 tests/CuTmpdir.pm |   15 ++-
 1 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/tests/CuTmpdir.pm b/tests/CuTmpdir.pm
index 84312a0..e21306a 100644
--- a/tests/CuTmpdir.pm
+++ b/tests/CuTmpdir.pm
@@ -45,11 +45,16 @@ sub chmod_1
 
 sub chmod_tree
 {
-  chdir $dir
-or warn $ME: failed to chdir to $dir: $!\n;
-  # Perform the equivalent of find . -type d -print0|xargs -0 chmod -R 700.
-  my $options = {untaint = 1, wanted = \chmod_1};
-  find ($options, '.');
+  if (chdir $dir)
+{
+  # Perform the equivalent of find . -type d -print0|xargs -0 chmod -R 700.
+  my $options = {untaint = 1, wanted = \chmod_1};
+  find ($options, '.');
+}
+  else
+{
+  warn $ME: failed to chdir to $dir: $!\n;
+}
 }
 
 sub import {
-- 
1.5.4.3

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: FYI, 11 mostly-test-related patches

2008-04-28 Thread Bo Borgerson
Eric Blake wrote:
 I'm now seeing this when running ./bootstrap:
 
 ./bootstrap: m4/xsize.m4 overrides ._bootmp2/m4/xsize.m4
 Undefined subroutine Test::test_vector called at tests/mk-script line 44.
 ./bootstrap: aclocal --force -I m4 ...
 
 and am not yet sure if it is from one of these patches or from an external 
 change of upgrading from perl 5.8.8 to 5.10.0.
 

Hmm... I'm not seeing that (still on 5.8.8 here).

But I _am_ seeing some warnings about trying to mess around in
`tests/wc', which doesn't seem to exist anymore.


./bootstrap: cp -f ._bootmp2/po/remove-potcdate.sin po/remove-potcdate.sin
./bootstrap: 537: cannot create tests/wc/Makefile.amt: Directory nonexistent
./bootstrap: 537: cannot create tests/wc/Makefile.amt: Directory nonexistent
./bootstrap: 537: cannot create tests/wc/Makefile.amt: Directory nonexistent
./bootstrap: 537: cannot create tests/wc/Makefile.amt: Directory nonexistent
chmod: cannot access `tests/wc/Makefile.amt': No such file or directory
mv: cannot stat `tests/wc/Makefile.amt': No such file or directory
./bootstrap: aclocal --force -I m4 ...




I've attached a patch that addresses this.

Thanks,

Bo
From 07cf6a226f75aa9e91060d657fdfebff6e74a618 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Mon, 28 Apr 2008 14:58:51 -0400
Subject: [PATCH] Remove references to tests/wc from bootstrap

* bootstrap: Don't try to initialize anything in tests/wc.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 bootstrap |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/bootstrap b/bootstrap
index 1274af2..c4b9b62 100755
--- a/bootstrap
+++ b/bootstrap
@@ -535,7 +535,7 @@ fi
 mam_template=tests/Makefile.am.in
 if test -f $mam_template; then
   PERL=perl
-  for tool in cut head join pr sort tac tail test tr uniq wc; do
+  for tool in cut head join pr sort tac tail test tr uniq; do
 m=tests/$tool/Makefile.am
 t=${m}t
 rm -f $m $t
-- 
1.5.4.3

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[PATCH] Add new program: renice

2008-04-27 Thread Bo Borgerson
Hi,

This is a basic implementation of the `renice' utility:

http://www.opengroup.org/onlinepubs/95399/utilities/renice.html

It supports both the syntax descibed above for relative niceness
adjustment and also the syntax for absolute niceness modification used
by the util-linux-ng `renice'.

Unlike both of the above this version also provides long forms for each option.

Please see attached or fetch at:

$ git fetch git://repo.or.cz/coreutils/bo.git renice:renice

Thanks,

Bo
From c4717c8b0add43a67a8ff052b458306230259e11 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Fri, 25 Apr 2008 18:58:08 -0400
Subject: [PATCH] Add new program: renice

* doc/coreutils.texi: Explain new program.
* man/renice.x: Manpage template for new program.
* po/POTFILES.in: List new program.
* src/Makefile.am: List new program.
* src/renice.c: New program.
* tests/Makefile.am: List test for new program.
* tests/misc/renice: Tests for new program.
* AUTHORS: Register as author.
* README: List new program.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 AUTHORS|1 +
 README |2 +-
 doc/coreutils.texi |   72 +-
 man/renice.x   |6 +
 po/POTFILES.in |1 +
 src/Makefile.am|2 +-
 src/renice.c   |  412 
 tests/Makefile.am  |1 +
 tests/misc/renice  |   93 
 9 files changed, 587 insertions(+), 3 deletions(-)
 create mode 100644 man/renice.x
 create mode 100644 src/renice.c
 create mode 100755 tests/misc/renice

diff --git a/AUTHORS b/AUTHORS
index 807857f..1840f0c 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -61,6 +61,7 @@ printf: David MacKenzie
 ptx: F. Pinard
 pwd: Jim Meyering
 readlink: Dmitry V. Levin
+renice: Bo Borgerson
 rm: Paul Rubin, David MacKenzie, Richard Stallman, Jim Meyering
 rmdir: David MacKenzie
 runcon: Russell Coker
diff --git a/README b/README
index 7a608f4..3bfd4e9 100644
--- a/README
+++ b/README
@@ -11,7 +11,7 @@ The programs that can be built with this package are:
   csplit cut date dd df dir dircolors dirname du echo env expand expr
   factor false fmt fold groups head hostid hostname id install join kill
   link ln logname ls md5sum mkdir mkfifo mknod mktemp mv nice nl nohup
-  od paste pathchk pinky pr printenv printf ptx pwd readlink rm rmdir
+  od paste pathchk pinky pr printenv printf ptx pwd readlink renice rm rmdir
   runcon seq sha1sum sha224sum sha256sum sha384sum sha512sum shred shuf
   sleep sort split stat stty su sum sync tac tail tee test touch tr true
   tsort tty uname unexpand uniq unlink uptime users vdir wc who whoami yes
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index f42e736..17f993f 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -92,6 +92,7 @@
 * ptx: (coreutils)ptx invocation.   Produce permuted indexes.
 * pwd: (coreutils)pwd invocation.   Print working directory.
 * readlink: (coreutils)readlink invocation. Print referent of a symlink.
+* renice: (coreutils)renice invocation. Modify niceness of processes.
 * rm: (coreutils)rm invocation. Remove files.
 * rmdir: (coreutils)rmdir invocation.   Remove empty directories.
 * seq: (coreutils)seq invocation.   Print numeric sequences
@@ -191,7 +192,7 @@ Free Documentation License''.
 * User information::   id logname whoami groups users who
 * System context:: date uname hostname hostid
 * Modified command invocation::chroot env nice nohup su
-* Process control::kill
+* Process control::kill renice
 * Delaying::   sleep
 * Numeric operations:: factor seq
 * File permissions::   Access modes.
@@ -425,6 +426,7 @@ Modified command invocation
 Process control
 
 * kill invocation::  Sending a signal to processes.
+* renice invocation::Modify the niceness of processes.
 
 Delaying
 
@@ -13872,6 +13874,7 @@ might find this idea strange at first.
 
 @menu
 * kill invocation:: Sending a signal to processes.
+* renice invocation::   Modify the niceness of processes.
 @end menu
 
 
@@ -14024,6 +14027,72 @@ File size limit exceeded.
 also support at least eight real-time signals called @samp{RTMIN},
 @samp{RTMIN+1}, @dots{}, @samp{RTMAX-1}, @samp{RTMAX}.
 
[EMAIL PROTECTED] renice invocation
[EMAIL PROTECTED] @command{renice}: Modify the niceness of processes
+
[EMAIL PROTECTED] renice
[EMAIL PROTECTED] modify the niceness of processes
+
+The @command{renice} command adjusts the niceness of running processes,
+affecting process scheduling.  Nicenesses range from -20 (most favorable
+scheduling) to 19 (least favorable).
+
[EMAIL PROTECTED]
+renice -n @var{adjustment} [-p | -g | -u] @[EMAIL PROTECTED]
+renice @var{priority} [-p | -g | -u ] @[EMAIL PROTECTED]
[EMAIL PROTECTED] example
+
+The first form of the @command{renice

Re: [PATCH] Add new program: renice

2008-04-27 Thread Bo Borgerson
On Sun, Apr 27, 2008 at 5:16 PM, Jim Meyering [EMAIL PROTECTED] wrote:
   Thanks, but Bob Proulx has written one already,
   as hinted at in TODO.  I think it's nearly ready...
 

 Ah, so I see.  Oh well, it was a fun weekend project. :)

 Thanks,

 Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[PATCH] Add new program: psub

2008-04-27 Thread Bo Borgerson
Hi,

This program uses the temporary fifo management system that I built
for zargs to provide generic process substitution for arguments to a
sub-command.

This program has some advantages over the process substitution built
into some shells (bash, zsh, ksh, ???):

1. It doesn't rely on having a shell that supports built-in process
substitution.
2. By using descriptively named temporary fifos it allows programs
that include filenames in output or diagnostic messages to provide
more useful information than with '/dev/fd/*' inputs.
3. It supports `--files0-from=F' style argument passing, as well.

Examples:

Where in bash you might do:

$ uniq bigfile | tee (wc -l  bigfile-uniq.count) | gzip -c  bigfile-uniq.gz

With psub you would do:

$ uniq bigfile | psub tee wc -l  bigfile-uniq.count | gzip -c 
bigfile-uniq.gz


And where in bash you might do:

$ wc -l (ls a) (ls b)
  2 /dev/fd/63
  1 /dev/fd/62
  3 total

With psub you could do:

$ psub wc -l ls a ls b
  2 /tmp/psub25j3Al/ls a
  1 /tmp/psub25j3Al/ls b
  3 total

Or even:

$ find * -maxdepth 0 -type d | sed s/^/ls / | tr '\n' '\0' | psub
wc -l --files0-from=-
  2 /tmp/psubSZT8xC/ls a
  5 /tmp/psubSZT8xC/ls autom4te.cache
  1 /tmp/psubSZT8xC/ls b
 20 /tmp/psubSZT8xC/ls build-aux
 13 /tmp/psubSZT8xC/ls doc
  3 /tmp/psubSZT8xC/ls gl
 19 /tmp/psubSZT8xC/ls gnulib
269 /tmp/psubSZT8xC/ls gnulib-tests
648 /tmp/psubSZT8xC/ls lib
299 /tmp/psubSZT8xC/ls m4
203 /tmp/psubSZT8xC/ls man
  3 /tmp/psubSZT8xC/ls old
121 /tmp/psubSZT8xC/ls po
345 /tmp/psubSZT8xC/ls src
 49 /tmp/psubSZT8xC/ls tests
   2000 total

The attached version doesn't include any `doc/' or `tests/' yet, but
should be functional enough to give a sense of usage.  If this looks
like something that might be a useful addition to coreutils I'll be
happy to write some documentation and tests and resubmit.

Also available for fetch at:

$ git fetch git://repo.or.cz/coreutils/bo.git psub:psub

Thanks,

Bo
From d61aff113e88ed72efc6d566b45fc487ab91481c Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 27 Apr 2008 20:11:39 -0400
Subject: [PATCH] Add new program: psub

* src/psub.c: New program to manage temporary fifos for pipelines.
* man/psub.x: Manfile template for new program.
* src/Makefile.am: List new program.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 man/psub.x  |4 +
 src/Makefile.am |2 +-
 src/psub.c  |  864 +++
 3 files changed, 869 insertions(+), 1 deletions(-)
 create mode 100644 man/psub.x
 create mode 100644 src/psub.c

diff --git a/man/psub.x b/man/psub.x
new file mode 100644
index 000..aaa7313
--- /dev/null
+++ b/man/psub.x
@@ -0,0 +1,4 @@
+[NAME]
+psub \- manage pipelines with temporary fifos
+[DESCRIPTION]
+.\ Add any additional description here
diff --git a/src/Makefile.am b/src/Makefile.am
index 668e178..225194b 100644
--- a/src/Makefile.am
+++ b/src/Makefile.am
@@ -37,7 +37,7 @@ EXTRA_PROGRAMS = \
   nl od paste pr ptx sha1sum sha224sum sha256sum sha384sum sha512sum \
   shuf sort split sum tac tail tr tsort unexpand uniq wc \
   basename date dirname echo env expr factor false \
-  id kill logname pathchk printenv printf pwd \
+  id kill logname pathchk printenv printf psub pwd \
   runcon seq sleep tee \
   test true tty whoami yes \
   base64
diff --git a/src/psub.c b/src/psub.c
new file mode 100644
index 000..1834d0c
--- /dev/null
+++ b/src/psub.c
@@ -0,0 +1,864 @@
+/* psub -- manage pipelines with temporary fifos
+   Copyright (C) 2008 Free Software Foundation, Inc.
+
+   This program is free software: you can redistribute it and/or modify
+   it under the terms of the GNU General Public License as published by
+   the Free Software Foundation, either version 3 of the License, or
+   (at your option) any later version.
+
+   This program is distributed in the hope that it will be useful,
+   but WITHOUT ANY WARRANTY; without even the implied warranty of
+   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+   GNU General Public License for more details.
+
+   You should have received a copy of the GNU General Public License
+   along with this program.  If not, see http://www.gnu.org/licenses/.  */
+
+
+/* psub - manage pipelines with temporary fifos
+
+   Written by Bo Borgerson.  */
+
+#include config.h
+#include getopt.h
+#include stdio.h
+#include sys/types.h
+#include sys/wait.h
+#include signal.h
+#include system.h
+#include error.h
+#include long-options.h
+#include readtokens0.h
+#include quote.h
+#include xstrtol.h
+
+#define PROGRAM_NAME psub
+
+#define AUTHORS Bo Borgerson
+
+#ifndef DEFAULT_TMPDIR
+# define DEFAULT_TMPDIR /tmp
+#endif
+
+/* Use SA_NOCLDSTOP as a proxy for whether the sigaction machinery is
+   present.  */
+#ifndef SA_NOCLDSTOP
+# define SA_NOCLDSTOP 0
+/* No sigprocmask.  Always 'return' zero. */
+# define sigprocmask(How, Set, Oset) (0)
+# define

Re: [PATCH] Use a hash rather than a linked-list for cycle check in cp

2008-04-22 Thread Bo Borgerson
On Tue, Apr 22, 2008 at 3:03 PM, Jim Meyering [EMAIL PROTECTED] wrote:

  Hi Bo,

  Thanks for that patch.

  However, let's see if Cai Xianchao and Li Zefan
  are still working on rewriting cp to use openat-style functions.

   http://thread.gmane.org/gmane.comp.gnu.coreutils.bugs/12041

  Because once cp is rewritten the way I outlined later in
  that thread, there will be no need for your patch.


Hi,

I may be missing something, but from a scan of that thread it seems to
me that the two patches are actually mutually beneficial.

My patch makes the check for cycles the directory graph cheaper, but
the limitation of PATH_MAX meant that the maximum penalty for this
check wasn't ever that significant.

If the cap on recursion imposed by PATH_MAX were to be lifted then the
performance benefit realized from a cheaper cycle check would be more
significant.

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[PATCH] Improve memory management in join

2008-04-22 Thread Bo Borgerson
Hi,

This improves the performance of `join' by reducing memory management
overhead and eliminating unnecessary copies for order checking:

$ valgrind src/join.master ja jb
==23744== malloc/free: 4,571,152 allocs, 4,571,152 frees, 255,971,774
bytes allocated.

$ valgrind src/join ja jb
==23738== malloc/free: 1,405 allocs, 1,405 frees, 65,858 bytes allocated.

$ time src/join.master ja jb
user0m27.126s

$ time src/join ja jb
user0m17.297s

Thanks,

Bo
From 623a2f43593093b3fb8cde9472bf5ecec652b6d3 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Tue, 22 Apr 2008 16:19:58 -0400
Subject: [PATCH] Improve memory management in join

* src/join.c (struct seq): Use a (struct line **) for `lines' rather than
one long (struct line *).  This allows individual lines to be swapped out
if necessary.
(reset_line): Get a line ready for new input.
(init_linep): Create a new line and assign it to the the pointer passed in.
(spareline[2]): Hold a spare line for each input file.
(free_spareline): Clean up.
(get_line): Take a (struct line **) instead of a (struct line *).  If the
line to be overwritten is the previous line for the current file then swap
it out for the spare.
(join): Accomodate new structure of SEQs and new parameters to get_line;
Don't free stale lines until the end -- they're re-usable now.
(dup_line): Removed.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 src/join.c |  169 +++
 1 files changed, 89 insertions(+), 80 deletions(-)

diff --git a/src/join.c b/src/join.c
index b8a0011..56eabc5 100644
--- a/src/join.c
+++ b/src/join.c
@@ -40,6 +40,12 @@
 
 #define join system_join
 
+#define SWAPLINES(a, b) do { \
+  struct line *tmp = a; \
+  a = b; \
+  b = tmp; \
+} while (0);
+
 /* An element of the list identifying which fields to print for each
output line.  */
 struct outlist
@@ -76,14 +82,19 @@ struct seq
   {
 size_t count;			/* Elements used in `lines'.  */
 size_t alloc;			/* Elements allocated in `lines'.  */
-struct line *lines;
+struct line **lines;
   };
 
 /* The name this program was run with.  */
 char *program_name;
 
 /* The previous line read from each file. */
-static struct line *prevline[2];
+static struct line *prevline[2] = {NULL, NULL};
+
+/* This provides an extra line buffer for each file.  We need these if we
+   try to read two consecutive lines into the same buffer, since we don't
+   want to overwrite the previous buffer before we check order. */
+static struct line *spareline[2] = {NULL, NULL};
 
 /* True if the LC_COLLATE locale is hard.  */
 static bool hard_LC_COLLATE;
@@ -260,33 +271,6 @@ xfields (struct line *line)
   extract_field (line, ptr, lim - ptr);
 }
 
-static struct line *
-dup_line (const struct line *old)
-{
-  struct line *newline = xmalloc (sizeof *newline);
-  size_t i;
-
-  /* Duplicate the buffer. */
-  initbuffer (newline-buf);
-  newline-buf.buffer = xmalloc (old-buf.size);
-  newline-buf.size = old-buf.size;
-  memcpy (newline-buf.buffer, old-buf.buffer, old-buf.length);
-  newline-buf.length = old-buf.length;
-
-  /* Duplicate the field positions. */
-  newline-fields = xnmalloc (old-nfields_allocated, sizeof *newline-fields);
-  newline-nfields = old-nfields;
-  newline-nfields_allocated = old-nfields_allocated;
-
-  for (i = 0; i  old-nfields; i++)
-{
-  newline-fields[i].len = old-fields[i].len;
-  newline-fields[i].beg = newline-buf.buffer + (old-fields[i].beg
-		  - old-buf.buffer);
-}
-  return newline;
-}
-
 static void
 freeline (struct line *line)
 {
@@ -393,49 +377,69 @@ check_order (const struct line *prev,
 }
 }
 
+static inline void
+reset_line (struct line *line)
+{
+  line-nfields = 0;
+}
+
+static struct line *
+init_linep (struct line **linep)
+{
+  struct line *line = xmalloc (sizeof *line);
+  memset (line, '\0', sizeof *line);
+  *linep = line;
+  return line;
+}
+
 /* Read a line from FP into LINE and split it into fields.
Return true if successful.  */
 
 static bool
-get_line (FILE *fp, struct line *line, int which)
+get_line (FILE *fp, struct line **linep, int which)
 {
-  initbuffer (line-buf);
+  struct line *line = *linep;
+
+  if (line == prevline[which - 1])
+{
+  SWAPLINES (line, spareline[which - 1]);
+  *linep = line;
+}
+
+  if (line)
+reset_line (line);
+  else
+line = init_linep (linep);
 
   if (! readlinebuffer (line-buf, fp))
 {
   if (ferror (fp))
 	error (EXIT_FAILURE, errno, _(read error));
-  free (line-buf.buffer);
-  line-buf.buffer = NULL;
+  freeline (line);
   return false;
 }
 
-  line-nfields_allocated = 0;
-  line-nfields = 0;
-  line-fields = NULL;
   xfields (line);
 
   if (prevline[which - 1])
-{
-  check_order (prevline[which - 1], line, which);
-  freeline (prevline[which - 1]);
-  free (prevline[which - 1]);
-}
-  prevline[which - 1] = dup_line (line);
+check_order (prevline[which - 1

Re: [PATCH] Make comm check order of input files

2008-04-21 Thread Bo Borgerson
Hi,

The previous version did not warn if the final record in a file was
out of order and `--check-order' was not in effect.

Thanks,

Bo
From dc34eed9e6ee34f473a8d74b98bccaf082fe79c2 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 20 Apr 2008 21:24:16 -0400
Subject: [PATCH] Make comm check order of input files

* NEWS: List new behavior.
* doc/coreutils.texi (checkOrderOption) New macro for
describing `--check-order' and `--nocheck-order', used in
both join and comm.
* src/comm.c (main): Initialize new options.
(usage): Describe new options.
(compare_files): Keep an extra pair of buffers for the previous
line from each file to check the internal order.
(check_order): If an order-check is required, compare and handle
the result appropriately.
(copylinebuffer): Copy a linebuffer; used for copy before read.
* tests/misc/Makefile.am: List new test.
* tests/misc/comm: Tests for the comm program, including the
new order-checking functionality and attendant command-line options.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS   |8 ++
 doc/coreutils.texi |   39 +++---
 src/comm.c |  178 +++
 tests/misc/Makefile.am |1 +
 tests/misc/comm|  131 +++
 5 files changed, 329 insertions(+), 28 deletions(-)
 create mode 100755 tests/misc/comm

diff --git a/NEWS b/NEWS
index 04893c6..4038da2 100644
--- a/NEWS
+++ b/NEWS
@@ -1,5 +1,13 @@
 GNU coreutils NEWS-*- outline -*-
 
+* Noteworthy changes in release ??
+
+** New features
+
+  comm now verifies that the inputs are in sorted order.  This check can
+  be turned off with the --nocheck-order option.
+
+
 * Noteworthy changes in release 6.11 (2008-04-19) [stable]
 
 ** Bug fixes
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index f42e736..5ed7f43 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4342,6 +4342,32 @@ status that does not depend on the result of the comparison.
 Upon normal completion @command{comm} produces an exit code of zero.
 If there is an error it exits with nonzero status.
 
[EMAIL PROTECTED] checkOrderOption{cmd}
+If the @option{--check-order} option is given, unsorted inputs will
+cause a fatal error message.  If the option @option{--nocheck-order}
+is given, unsorted inputs will never cause an error message.  If
+neither of these options is given, wrongly sorted inputs are diagnosed
+only if an input file is found to contain unpairable lines.  If an
+input file is diagnosed as being unsorted, the @command{\cmd\} command
+will exit with a nonzero status (and the output should not be used).
+
+Forcing @command{\cmd\} to process wrongly sorted input files
+containing unpairable lines by specifying @option{--nocheck-order} is
+not guaranteed to produce any particular output.  The output will
+probably not correspond with whatever you hoped it would be.
[EMAIL PROTECTED] macro
[EMAIL PROTECTED]
+
[EMAIL PROTECTED] @samp
+
[EMAIL PROTECTED] --check-order
+Fail with an error message if either input file is wrongly ordered.
+
[EMAIL PROTECTED] --nocheck-order
+Do not check that both input files are in sorted order.
+
[EMAIL PROTECTED] table
+
 
 @node tsort invocation
 @section @command{tsort}: Topological sort
@@ -5183,18 +5209,7 @@ c c1 c2
 b b1 b2
 @end example
 
-If the @option{--check-order} option is given, unsorted inputs will
-cause a fatal error message.  If the option @option{--nocheck-order}
-is given, unsorted inputs will never cause an error message.  If
-neither of these options is given, wrongly sorted inputs are diagnosed
-only if an input file is found to contain unpairable lines.  If an
-input file is diagnosed as being unsorted, the @command{join} command
-will exit with a nonzero status (and the output should not be used).
-
-Forcing @command{join} to process wrongly sorted input files
-containing unpairable lines by specifying @option{--nocheck-order} is
-not guaranteed to produce any particular output.  The output will
-probably not correspond with whatever you hoped it would be.
[EMAIL PROTECTED]
 
 The defaults are:
 @itemize
diff --git a/src/comm.c b/src/comm.c
index cbda362..0a9e8b9 100644
--- a/src/comm.c
+++ b/src/comm.c
@@ -52,8 +52,31 @@ static bool only_file_2;
 /* If true, print lines that are found in both files. */
 static bool both;
 
+/* If nonzero, we have seen at least one unpairable line. */
+static bool seen_unpairable;
+
+/* If nonzero, we have warned about disorder in that file. */
+static bool issued_disorder_warning[2];
+
+/* If nonzero, check that the input is correctly ordered. */
+static enum
+  {
+CHECK_ORDER_DEFAULT,
+CHECK_ORDER_ENABLED,
+CHECK_ORDER_DISABLED
+  } check_input_order;
+
+enum
+{
+  CHECK_ORDER_OPTION = CHAR_MAX + 1,
+  NOCHECK_ORDER_OPTION
+};
+
+
 static struct option const long_options[] =
 {
+  {check-order, no_argument, NULL, CHECK_ORDER_OPTION},
+  {nocheck-order

Re: [PATCH] Make comm check order of input files

2008-04-21 Thread Bo Borgerson
Hi,

Pádraig pointed out that there's no reason to copy data around here.

This version avoids the copies.

Thanks Pádraig,

Bo
From 49ec3883efc8a89e8a4260f25bb50178aced1be4 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 20 Apr 2008 21:24:16 -0400
Subject: [PATCH] Make comm check order of input files

* NEWS: List new behavior.
* doc/coreutils.texi (checkOrderOption) New macro for
describing `--check-order' and `--nocheck-order', used in
both join and comm.
* src/comm.c (main): Initialize new options.
(usage): Describe new options.
(compare_files): Keep an extra pair of buffers for the previous
line from each file to check the internal order.
(check_order): If an order-check is required, compare and handle
the result appropriately.
(copylinebuffer): Copy a linebuffer; used for copy before read.
* tests/misc/Makefile.am: List new test.
* tests/misc/comm: Tests for the comm program, including the
new order-checking functionality and attendant command-line options.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS   |8 ++
 doc/coreutils.texi |   39 
 src/comm.c |  166 ++--
 tests/misc/Makefile.am |1 +
 tests/misc/comm|  131 ++
 5 files changed, 313 insertions(+), 32 deletions(-)
 create mode 100755 tests/misc/comm

diff --git a/NEWS b/NEWS
index 04893c6..4038da2 100644
--- a/NEWS
+++ b/NEWS
@@ -1,5 +1,13 @@
 GNU coreutils NEWS-*- outline -*-
 
+* Noteworthy changes in release ??
+
+** New features
+
+  comm now verifies that the inputs are in sorted order.  This check can
+  be turned off with the --nocheck-order option.
+
+
 * Noteworthy changes in release 6.11 (2008-04-19) [stable]
 
 ** Bug fixes
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index f42e736..5ed7f43 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4342,6 +4342,32 @@ status that does not depend on the result of the comparison.
 Upon normal completion @command{comm} produces an exit code of zero.
 If there is an error it exits with nonzero status.
 
[EMAIL PROTECTED] checkOrderOption{cmd}
+If the @option{--check-order} option is given, unsorted inputs will
+cause a fatal error message.  If the option @option{--nocheck-order}
+is given, unsorted inputs will never cause an error message.  If
+neither of these options is given, wrongly sorted inputs are diagnosed
+only if an input file is found to contain unpairable lines.  If an
+input file is diagnosed as being unsorted, the @command{\cmd\} command
+will exit with a nonzero status (and the output should not be used).
+
+Forcing @command{\cmd\} to process wrongly sorted input files
+containing unpairable lines by specifying @option{--nocheck-order} is
+not guaranteed to produce any particular output.  The output will
+probably not correspond with whatever you hoped it would be.
[EMAIL PROTECTED] macro
[EMAIL PROTECTED]
+
[EMAIL PROTECTED] @samp
+
[EMAIL PROTECTED] --check-order
+Fail with an error message if either input file is wrongly ordered.
+
[EMAIL PROTECTED] --nocheck-order
+Do not check that both input files are in sorted order.
+
[EMAIL PROTECTED] table
+
 
 @node tsort invocation
 @section @command{tsort}: Topological sort
@@ -5183,18 +5209,7 @@ c c1 c2
 b b1 b2
 @end example
 
-If the @option{--check-order} option is given, unsorted inputs will
-cause a fatal error message.  If the option @option{--nocheck-order}
-is given, unsorted inputs will never cause an error message.  If
-neither of these options is given, wrongly sorted inputs are diagnosed
-only if an input file is found to contain unpairable lines.  If an
-input file is diagnosed as being unsorted, the @command{join} command
-will exit with a nonzero status (and the output should not be used).
-
-Forcing @command{join} to process wrongly sorted input files
-containing unpairable lines by specifying @option{--nocheck-order} is
-not guaranteed to produce any particular output.  The output will
-probably not correspond with whatever you hoped it would be.
[EMAIL PROTECTED]
 
 The defaults are:
 @itemize
diff --git a/src/comm.c b/src/comm.c
index cbda362..b2b2bba 100644
--- a/src/comm.c
+++ b/src/comm.c
@@ -52,8 +52,31 @@ static bool only_file_2;
 /* If true, print lines that are found in both files. */
 static bool both;
 
+/* If nonzero, we have seen at least one unpairable line. */
+static bool seen_unpairable;
+
+/* If nonzero, we have warned about disorder in that file. */
+static bool issued_disorder_warning[2];
+
+/* If nonzero, check that the input is correctly ordered. */
+static enum
+  {
+CHECK_ORDER_DEFAULT,
+CHECK_ORDER_ENABLED,
+CHECK_ORDER_DISABLED
+  } check_input_order;
+
+enum
+{
+  CHECK_ORDER_OPTION = CHAR_MAX + 1,
+  NOCHECK_ORDER_OPTION
+};
+
+
 static struct option const long_options[] =
 {
+  {check-order, no_argument, NULL, CHECK_ORDER_OPTION},
+  {nocheck-order

Re: [Coreutils-announce] coreutils-6.11 released

2008-04-20 Thread Bo Borgerson
On Sun, Apr 20, 2008 at 5:24 PM, Jim Meyering [EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] (Karl Berry) wrote:
 join now verifies that the inputs are in sorted order.  This check 
 can
  
   How about doing the same for comm?

  Makes sense.  Did you just volunteer? ;-)


If not, I'll be happy to do it.

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[PATCH] Make comm check order of input files

2008-04-20 Thread Bo Borgerson
On Sun, Apr 20, 2008 at 8:35 PM, Karl Berry [EMAIL PROTECTED] wrote:
 If not, I'll be happy to do it.

  Please!

Here's a patch.

Bo
From 1a651ab6aedea0d0cc383f2e60c82fe7f0d395f0 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 20 Apr 2008 21:24:16 -0400
Subject: [PATCH] Make comm check order of input files

* NEWS: List new behavior.
* doc/coreutils.texi (checkOrderOption) New macro for
describing `--check-order' and `--nocheck-order', used in
both join and comm.
* src/comm.c (main): Initialize new options.
(usage): Describe new options.
(compare_files): Keep an extra pair of buffers for the previous
line from each file to check the internal order.
(check_order): If an order-check is required, compare and handle
the result appropriately.
(copylinebuffer): Copy a linebuffer; used for copy before read.
* tests/misc/Makefile.am: List new test.
* tests/misc/comm: Tests for the comm program, including the
new order-checking functionality and attendant command-line options.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS   |8 +++
 doc/coreutils.texi |   39 
 src/comm.c |  158 +++-
 tests/misc/Makefile.am |1 +
 tests/misc/comm|  121 
 5 files changed, 300 insertions(+), 27 deletions(-)
 create mode 100755 tests/misc/comm

diff --git a/NEWS b/NEWS
index 04893c6..4038da2 100644
--- a/NEWS
+++ b/NEWS
@@ -1,5 +1,13 @@
 GNU coreutils NEWS-*- outline -*-
 
+* Noteworthy changes in release ??
+
+** New features
+
+  comm now verifies that the inputs are in sorted order.  This check can
+  be turned off with the --nocheck-order option.
+
+
 * Noteworthy changes in release 6.11 (2008-04-19) [stable]
 
 ** Bug fixes
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index f42e736..5ed7f43 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4342,6 +4342,32 @@ status that does not depend on the result of the comparison.
 Upon normal completion @command{comm} produces an exit code of zero.
 If there is an error it exits with nonzero status.
 
[EMAIL PROTECTED] checkOrderOption{cmd}
+If the @option{--check-order} option is given, unsorted inputs will
+cause a fatal error message.  If the option @option{--nocheck-order}
+is given, unsorted inputs will never cause an error message.  If
+neither of these options is given, wrongly sorted inputs are diagnosed
+only if an input file is found to contain unpairable lines.  If an
+input file is diagnosed as being unsorted, the @command{\cmd\} command
+will exit with a nonzero status (and the output should not be used).
+
+Forcing @command{\cmd\} to process wrongly sorted input files
+containing unpairable lines by specifying @option{--nocheck-order} is
+not guaranteed to produce any particular output.  The output will
+probably not correspond with whatever you hoped it would be.
[EMAIL PROTECTED] macro
[EMAIL PROTECTED]
+
[EMAIL PROTECTED] @samp
+
[EMAIL PROTECTED] --check-order
+Fail with an error message if either input file is wrongly ordered.
+
[EMAIL PROTECTED] --nocheck-order
+Do not check that both input files are in sorted order.
+
[EMAIL PROTECTED] table
+
 
 @node tsort invocation
 @section @command{tsort}: Topological sort
@@ -5183,18 +5209,7 @@ c c1 c2
 b b1 b2
 @end example
 
-If the @option{--check-order} option is given, unsorted inputs will
-cause a fatal error message.  If the option @option{--nocheck-order}
-is given, unsorted inputs will never cause an error message.  If
-neither of these options is given, wrongly sorted inputs are diagnosed
-only if an input file is found to contain unpairable lines.  If an
-input file is diagnosed as being unsorted, the @command{join} command
-will exit with a nonzero status (and the output should not be used).
-
-Forcing @command{join} to process wrongly sorted input files
-containing unpairable lines by specifying @option{--nocheck-order} is
-not guaranteed to produce any particular output.  The output will
-probably not correspond with whatever you hoped it would be.
[EMAIL PROTECTED]
 
 The defaults are:
 @itemize
diff --git a/src/comm.c b/src/comm.c
index cbda362..5b1e5a2 100644
--- a/src/comm.c
+++ b/src/comm.c
@@ -52,8 +52,31 @@ static bool only_file_2;
 /* If true, print lines that are found in both files. */
 static bool both;
 
+/* If nonzero, we have seen at least one unpairable line. */
+static bool seen_unpairable;
+
+/* If nonzero, we have warned about disorder in that file. */
+static bool issued_disorder_warning[2];
+
+/* If nonzero, check that the input is correctly ordered. */
+static enum
+  {
+CHECK_ORDER_DEFAULT,
+CHECK_ORDER_ENABLED,
+CHECK_ORDER_DISABLED
+  } check_input_order;
+
+enum
+{
+  CHECK_ORDER_OPTION = CHAR_MAX + 1,
+  NOCHECK_ORDER_OPTION
+};
+
+
 static struct option const long_options[] =
 {
+  {check-order, no_argument, NULL, CHECK_ORDER_OPTION},
+  {nocheck

[PATCH] Use a hash rather than a linked-list for cycle check in cp

2008-04-16 Thread Bo Borgerson
This addresses a FIXME in src/copy.c:


-/* FIXME: rewrite this to use a hash table so we avoid the quadratic
-   performance hit that's probably noticeable only on trees deeper
-   than a few hundred levels.  See use of active_dir_map in remove.c  */


The performance benefit is there, but on my machine with a PATH_MAX of
4096 it's hard to see, because the userland work `cp' does is dwarfed
by the work the kernel does on its behalf:


$ time src/cp.master -r a b

real0m54.032s
user0m3.236s-- coreutils HEAD
sys 0m47.335s

$ time src/cp -r a c

real0m53.475s
user0m0.624s-- with patch
sys 0m48.639s


Thanks,

Bo
From 224328e4bda44aa25cd5c98b1c13751ecea865c7 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Wed, 16 Apr 2008 13:44:36 -0400
Subject: [PATCH] Use a hash rather than a linked-list for cycle check in cp

* NEWS: List the change of cycle check behavior.
* src/copy.c (struct dir_info): These go in the hash.
(static Hash_table *ancestry_hash): This is the hash.
(static size_t dir_info_hasher): Simple hasher based on INO.
(static bool dir_info_comparator): Checks equivalence of INO and DEV.
(static bool ancestry_insert): Insert a dir_info entry for the current dir.
(static bool ancestry_delete): Delete the hash entry for the current dir.
(copy_internal): Now calls ancestry_insert and ancestry_delete instead of
managing the linked-list inline and calling is_ancestor for the cycle check.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS   |2 +
 src/copy.c |  106 +++
 2 files changed, 72 insertions(+), 36 deletions(-)

diff --git a/NEWS b/NEWS
index 3a584e9..111e0c1 100644
--- a/NEWS
+++ b/NEWS
@@ -69,6 +69,8 @@ GNU coreutils NEWS-*- outline -*-
 
   seq gives better diagnostics for invalid formats.
 
+  cp now uses a hash for cycle checks rather than a linked-list of ancestry.
+
 ** Portability
 
   rm now works properly even on systems like BeOS and Haiku,
diff --git a/src/copy.c b/src/copy.c
index c2f21a3..8e673f0 100644
--- a/src/copy.c
+++ b/src/copy.c
@@ -83,19 +83,74 @@ rpl_mkfifo (char const *file, mode_t mode)
 #define SAME_GROUP(A, B) ((A).st_gid == (B).st_gid)
 #define SAME_OWNER_AND_GROUP(A, B) (SAME_OWNER (A, B)  SAME_GROUP (A, B))
 
-struct dir_list
+
+/* This is for looking up directories by inode/device to ensure we don't
+   have any cycles. */
+static Hash_table *ancestors_hash;
+
+enum { INIT_ANCESTRY_SIZE = 47 };
+
+struct dir_info
 {
-  struct dir_list *parent;
   ino_t ino;
   dev_t dev;
 };
 
+static size_t
+dir_info_hasher (const void *entry, size_t tabsize)
+{
+  const struct dir_info *node = entry;
+  return node-ino % tabsize;
+}
+
+static bool
+dir_info_comparator (const void *e1, const void *e2)
+{
+  const struct dir_info *n1 = e1, *n2 = e2;
+  return n1-ino == n2-ino  n1-dev == n2-dev;
+}
+
+/* First check whether this directory has been seen.  If it has then
+   return false because we've encountered a cycle in the graph. If it
+   hasn't, then insert it into the ancestry hash and return true. */
+
+static bool
+ancestry_insert (struct dir_info *dir, struct stat *src_sb)
+{
+  dir-ino = src_sb-st_ino;
+  dir-dev = src_sb-st_dev;
+
+  if (! ancestors_hash)
+{
+  ancestors_hash = hash_initialize (INIT_ANCESTRY_SIZE, NULL,
+	dir_info_hasher,
+	dir_info_comparator, NULL);
+  if (! ancestors_hash)
+	xalloc_die ();
+}
+
+  if (hash_lookup (ancestors_hash, dir))
+return false;
+
+  hash_insert (ancestors_hash, dir);
+
+  return true;
+}
+
+/* This is called when recursion through DIR is complete.  Note that
+   `*dir' is actually a pointer into the stack of the caller, so there's
+   no free here. */
+static inline void
+ancestry_delete (const struct dir_info *dir)
+{
+  hash_delete (ancestors_hash, dir);
+}
+
 /* Initial size of the cp.dest_info hash table.  */
 #define DEST_INFO_INITIAL_CAPACITY 61
 
 static bool copy_internal (char const *src_name, char const *dst_name,
 			   bool new_dst, dev_t device,
-			   struct dir_list *ancestors,
 			   const struct cp_options *x,
 			   bool command_line_arg,
 			   bool *copy_into_self,
@@ -110,23 +165,6 @@ static char const *top_level_dst_name;
 /* The invocation name of this program.  */
 extern char *program_name;
 
-/* FIXME: describe */
-/* FIXME: rewrite this to use a hash table so we avoid the quadratic
-   performance hit that's probably noticeable only on trees deeper
-   than a few hundred levels.  See use of active_dir_map in remove.c  */
-
-static bool
-is_ancestor (const struct stat *sb, const struct dir_list *ancestors)
-{
-  while (ancestors != 0)
-{
-  if (ancestors-ino == sb-st_ino  ancestors-dev == sb-st_dev)
-	return true;
-  ancestors = ancestors-parent;
-}
-  return false;
-}
-
 /* Read the contents of the directory SRC_NAME_IN, and recursively
copy the contents to DST_NAME_IN.  NEW_DST is true

Re: [PATCH] Add new program: zargs

2008-04-15 Thread Bo Borgerson
Hi,

I noticed that the name of the --files0-from option description macro
in doc/coreutils.texi changed, so I updated the call in this patch.

I also replaced the polling reaper with a SIGCHLD sigaction handler.

Thanks,

Bo

On Fri, Apr 11, 2008 at 11:24 AM, Bo Borgerson [EMAIL PROTECTED] wrote:
 Hi,

  This is a much trimmed-back version of the program I previously
  submitted as `magic'.

  Thanks to Bob Proulx for the feedback.

  The zargs program automatically decompresses inputs to a program.

  $ zargs wc input/*
 177 9766908 /tmp/zargsE0oy5L/TODO.bz2
 177 9766908 /tmp/zargsE0oy5L/TODO.gz
 177 9766908 input/TODO.txt
 5312928   20724 total

  Thanks,

  Bo

From edfec1729e6e042ff0af8ed09070a74540eeccb7 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 6 Apr 2008 17:54:08 -0400
Subject: [PATCH] Add new program: zargs

* AUTHORS: Register as the author.
* NEWS: Advertise new program.
* README: List new program.
* doc/coreutils.texi: Describe new program.
* man/Makefile.am: Add new program.
* man/zargs.x: Add new man page template.
* po/POTFILES.in: Add new program.
* src/Makefile.am: Add new program.
* src/zargs.c: Add new program.
* tests/misc/Makefile.am: Add new test.
* tests/misc/help-version: Accomodate new program.
* tests/misc/zargs: Test new program.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 AUTHORS |1 +
 NEWS|3 +
 README  |1 +
 doc/coreutils.texi  |   74 -
 man/Makefile.am |1 +
 man/zargs.x |4 +
 po/POTFILES.in  |1 +
 src/Makefile.am |2 +-
 src/zargs.c |  901 +++
 tests/misc/Makefile.am  |3 +-
 tests/misc/help-version |1 +
 tests/misc/zargs|   82 +
 12 files changed, 1070 insertions(+), 4 deletions(-)
 create mode 100644 man/zargs.x
 create mode 100644 src/zargs.c
 create mode 100755 tests/misc/zargs

diff --git a/AUTHORS b/AUTHORS
index 807857f..5c1c6d1 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -100,3 +100,4 @@ wc: Paul Rubin, David MacKenzie
 who: Joseph Arceneaux, David MacKenzie, Michael Stone
 whoami: Richard Mlynarik
 yes: David MacKenzie
+zargs: Bo Borgerson
diff --git a/NEWS b/NEWS
index 3a584e9..5ade4d2 100644
--- a/NEWS
+++ b/NEWS
@@ -87,6 +87,9 @@ GNU coreutils NEWS-*- outline -*-
   Fix a non-portable use of sed in configure.ac.
   [bug introduced in coreutils-6.9.92]
 
+** New programs
+
+zargs: run a program with automatically decompressed inputs
 
 * Noteworthy changes in release 6.9.92 (2008-01-12) [beta]
 
diff --git a/README b/README
index 7a608f4..4092411 100644
--- a/README
+++ b/README
@@ -15,6 +15,7 @@ The programs that can be built with this package are:
   runcon seq sha1sum sha224sum sha256sum sha384sum sha512sum shred shuf
   sleep sort split stat stty su sum sync tac tail tee test touch tr true
   tsort tty uname unexpand uniq unlink uptime users vdir wc who whoami yes
+  zargs
 
 See the file NEWS for a list of major changes in the current release.
 
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index f42e736..df0875b 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -126,6 +126,7 @@
 * who: (coreutils)who invocation.   Print who is logged in.
 * whoami: (coreutils)whoami invocation. Print effective user ID.
 * yes: (coreutils)yes invocation.   Print a string indefinitely.
+* zargs: (coreutils)zargs invocation.   Run with decompressed inputs.
 @end direntry
 
 @copying
@@ -190,7 +191,7 @@ Free Documentation License''.
 * Working context::pwd stty printenv tty
 * User information::   id logname whoami groups users who
 * System context:: date uname hostname hostid
-* Modified command invocation::chroot env nice nohup su
+* Modified command invocation::chroot env nice nohup su zargs
 * Process control::kill
 * Delaying::   sleep
 * Numeric operations:: factor seq
@@ -421,6 +422,7 @@ Modified command invocation
 * nice invocation::  Run a command with modified niceness
 * nohup invocation:: Run a command immune to hangups
 * su invocation::Run a command with substitute user and group ID
+* zargs invocation:: Run a command with decompressed inputs
 
 Process control
 
@@ -680,7 +682,8 @@ meanings with the values @samp{0} and @samp{1}.
 Here are some of the exceptions:
 @command{chroot}, @command{env}, @command{expr},
 @command{nice}, @command{nohup}, @command{printenv},
[EMAIL PROTECTED], @command{su}, @command{test}, @command{tty}.
[EMAIL PROTECTED], @command{su}, @command{test}, @command{tty},
[EMAIL PROTECTED]
 
 
 @node Backup options
@@ -13361,6 +13364,7 @@ user, etc.
 * nice invocation:: Modify niceness

Re: [PATCH] Add new program: zargs

2008-04-15 Thread Bo Borgerson
On Tue, Apr 15, 2008 at 9:01 AM, Pádraig Brady [EMAIL PROTECTED] wrote:
  How does this compare with `zrun` from moreutils?
  Would it be more appropriate to merge zargs into moreutils?

Hi Pádraig,

Yes, it looks like zrun attempts to perform a similar task.

From a quick peek I notice a few things:

1. It checks file extensions, rather than `magic' bytes.  I think in
Perl there is a File::Type module that might help here.
2. It opens temporary files instead of FIFOs.  This is a potential
storage issue.  It also means that zrun waits for all decompression to
complete before invoking its COMMAND.
3. It opens files before forking, so is potentially rlimited beyond
what the child could handle.
4. It doesn't appear to clean up temporary files if killed with a signal.
5. It doesn't support false-positives (eg: `file.gz' that's not
actually compressed).
6. It doesn't support the --files0-from=F calling convention.

The major reason I'd prefer to submit this tool here is that I hope to
make it as robust as possible.  The coreutils package is attended to
by world-class developers.  Any tool that has undergone the scrutiny
of the subscribers to this list is likely to come out better as a
result, even if it's ultimately rejected -- though I hope it's not ;).

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[PATCH] Add new program: zargs

2008-04-11 Thread Bo Borgerson
Hi,

This is a much trimmed-back version of the program I previously
submitted as `magic'.

Thanks to Bob Proulx for the feedback.

The zargs program automatically decompresses inputs to a program.

$ zargs wc input/*
177 9766908 /tmp/zargsE0oy5L/TODO.bz2
177 9766908 /tmp/zargsE0oy5L/TODO.gz
177 9766908 input/TODO.txt
5312928   20724 total

Thanks,

Bo
From 42e176d01982f038f12859880c530d3a49a5f4ac Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 6 Apr 2008 17:54:08 -0400
Subject: [PATCH] Add new program: zargs

* AUTHORS: Register as the author.
* NEWS: Advertise new program.
* README: List new program.
* doc/coreutils.texi: Describe new program.
* man/Makefile.am: Add new program.
* man/zargs.x: Add new man page template.
* po/POTFILES.in: Add new program.
* src/Makefile.am: Add new program.
* src/zargs.c: Add new program.
* tests/misc/Makefile.am: Add new test.
* tests/misc/help-version: Accomodate new program.
* tests/misc/zargs: Test new program.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 AUTHORS |1 +
 NEWS|3 +
 README  |1 +
 doc/coreutils.texi  |   74 -
 man/Makefile.am |1 +
 man/zargs.x |4 +
 po/POTFILES.in  |1 +
 src/Makefile.am |2 +-
 src/zargs.c |  910 +++
 tests/misc/Makefile.am  |3 +-
 tests/misc/help-version |1 +
 tests/misc/zargs|   82 +
 12 files changed, 1079 insertions(+), 4 deletions(-)
 create mode 100644 man/zargs.x
 create mode 100644 src/zargs.c
 create mode 100755 tests/misc/zargs

diff --git a/AUTHORS b/AUTHORS
index 807857f..5c1c6d1 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -100,3 +100,4 @@ wc: Paul Rubin, David MacKenzie
 who: Joseph Arceneaux, David MacKenzie, Michael Stone
 whoami: Richard Mlynarik
 yes: David MacKenzie
+zargs: Bo Borgerson
diff --git a/NEWS b/NEWS
index e208b30..69f548f 100644
--- a/NEWS
+++ b/NEWS
@@ -82,6 +82,9 @@ GNU coreutils NEWS-*- outline -*-
   Fix a non-portable use of sed in configure.ac.
   [bug introduced in coreutils-6.9.92]
 
+** New programs
+
+zargs: run a program with automatically decompressed inputs
 
 * Noteworthy changes in release 6.9.92 (2008-01-12) [beta]
 
diff --git a/README b/README
index 7a608f4..4092411 100644
--- a/README
+++ b/README
@@ -15,6 +15,7 @@ The programs that can be built with this package are:
   runcon seq sha1sum sha224sum sha256sum sha384sum sha512sum shred shuf
   sleep sort split stat stty su sum sync tac tail tee test touch tr true
   tsort tty uname unexpand uniq unlink uptime users vdir wc who whoami yes
+  zargs
 
 See the file NEWS for a list of major changes in the current release.
 
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 5a6f2c3..348ec30 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -126,6 +126,7 @@
 * who: (coreutils)who invocation.   Print who is logged in.
 * whoami: (coreutils)whoami invocation. Print effective user ID.
 * yes: (coreutils)yes invocation.   Print a string indefinitely.
+* zargs: (coreutils)zargs invocation.   Run with decompressed inputs.
 @end direntry
 
 @copying
@@ -190,7 +191,7 @@ Free Documentation License''.
 * Working context::pwd stty printenv tty
 * User information::   id logname whoami groups users who
 * System context:: date uname hostname hostid
-* Modified command invocation::chroot env nice nohup su
+* Modified command invocation::chroot env nice nohup su zargs
 * Process control::kill
 * Delaying::   sleep
 * Numeric operations:: factor seq
@@ -421,6 +422,7 @@ Modified command invocation
 * nice invocation::  Run a command with modified niceness
 * nohup invocation:: Run a command immune to hangups
 * su invocation::Run a command with substitute user and group ID
+* zargs invocation:: Run a command with decompressed inputs
 
 Process control
 
@@ -680,7 +682,8 @@ meanings with the values @samp{0} and @samp{1}.
 Here are some of the exceptions:
 @command{chroot}, @command{env}, @command{expr},
 @command{nice}, @command{nohup}, @command{printenv},
[EMAIL PROTECTED], @command{su}, @command{test}, @command{tty}.
[EMAIL PROTECTED], @command{su}, @command{test}, @command{tty},
[EMAIL PROTECTED]
 
 
 @node Backup options
@@ -13358,6 +13361,7 @@ user, etc.
 * nice invocation:: Modify niceness.
 * nohup invocation::Immunize to hangups.
 * su invocation::   Modify user and group ID.
+* zargs invocation::Pipe compressed inputs through decompressors.
 @end menu
 
 
@@ -13509,6 +13513,72 @@ Exit status:
 the exit status of @var{command} otherwise
 @end display
 
[EMAIL PROTECTED

Re: [PATCH] Add new program: magic

2008-04-10 Thread Bo Borgerson
On Wed, Apr 9, 2008 at 9:48 PM, Bob Proulx [EMAIL PROTECTED] wrote:
  I like this sort of general purpose utility that can work with a broad
  set of things much better than hacks to every utility.  It is
  definitely a better direction.

  But I don't like the name.  The name is too generic and doesn't give a
  clue as to what it actually does.  This is probably better to name
  something like daylight-commander or some such (with apologies to
  Nortan and midnight-commander).

Point taken :)

I knew I would need to change the name.  I should have done so before
submitting the draft.  Do you have any suggestions for a more
descriptive name?  How about 'fargs', as a contraction of fifo-args?
I'll use that at least for the duration of this message.

  Additionally I have to note that bash (and probably other shells)
  already supply this capability in a generic way.

   sort (zcat a) (zcat b) c

Yeah, bash is great!

There are some differences, though:

$ fargs wc input/*
3141895   12183 /tmp/fargsBsoaWi/HACKING.gz
5601796   31786 /tmp/fargsBsoaWi/THANKS.bz2
177 9766908 input/TODO
   10514667   50877 total

$ wc (zcat input/HACKING.gz) (bzcat input/THANKS.bz2) input/TODO
3141895   12183 /dev/fd/63
5601796   31786 /dev/fd/62
177 9766908 input/TODO
   10514667   50877 total

First, with the bash syntax you need to enumerate the set of commands
you want to run.  Second, for programs that use filenames in output or
diagnostic messages, the fifos produced by fargs may be somewhat more
legible.

Also, so far as I know there isn't any way currently to use
--files0-from=F style argument passing, as with `wc' (and pending in
`sort').  There may be a trick here that I just haven't learned yet,
but I'd like to be able to do the following (more so with `sort', but
I'll use `wc' for illustration).

$ find input/ -type f --files0 | fargs wc --files0-from=-
177 9766908 input/TODO
3141895   12183 /tmp/fargsLywSy2/HACKING.gz
5601796   31786 /tmp/fargsLywSy2/THANKS.bz2
   10514667   50877 total

  This is getting to be too heuristic driven (too error prone) for my
  tastes.

The precedence is `directiveoptionfilecommand'.  There are some
cases where the interpretation may be surprising.

If you have a file in your current working directory named 'ls' and
you intend to use 'ls' as a sub-command, then:

$ fargs wc ls

Won't give you what you want.  For this reason I included the 'exec:' directive:

$ fargs wc exec:ls

If you manage to produce a compressed file named '-l' and you run:

$ fargs wc -l

You could be waiting for a long time.  For this reason I included the
'file:' directive:

$ fargs wc file:-l

Finally, because `fargs' doesn't know which options of the invoked
command take arguments, if you run:

$ fargs sort -o output input/*

and `output' already exists, you may be in for a surprise.  For this
reason I included the 'skip:' directive:

$ fargs sort -o skip:output input/*

Of course this last case could be avoided by using the long-option form:

$ fargs sort --output=output input/*

Note that regardless of whether arbitrary sub-commands are allowed,
this last case is an issue.

   diff (ssh -n host1 cat /etc/passwd) (ssh -n host2 cat /etc/passwd)

That one really looked like a nail when I'd just finished building my
shiny new hammer. ;)

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[PATCH] Add new program: magic

2008-04-09 Thread Bo Borgerson
Hi,

As I mentioned last week, I've patched my local `sort' to allow
automatic decompression of input files if an option, --magic-open, is
passed on the command line.

As I thought more about this functionality I realized that it may be
more broadly useful.  Any utility that can operate on multiple input
files could benefit.  I wondered if it would be possible in a
non-invasive way to provide this service to other tools.  This is what
I came up with.

Instead of:
$ sort --magic-open a.gz b.bz2 c.txt

I run:
$ magic sort a.gz b.bz2 c.txt

This creates a temporary fifo and opens a decompression program for
each compressed input.  The corresponding files in the argument list
are replaced by these temporary fifos.  When the command completes,
the fifos are removed.  If a signal is received and execution is
terminated prematurely, the fifos are removed.

Then I realized that this automatic fifo management might be more
useful still.  In addition to checking the `magic' bytes at the
beginning of regular files for known decompression programs, I thought
it might be useful to allow an arbitrary sub-command to be used as an
input.

For example, to compare the output of two versions of a program:
$ magic diff ls -l src/ls -l

Or to compare files on two remote machines:
$ magic diff ssh host1 cat /etc/passwd ssh host2 cat /etc/passwd

I've attached a basic implementation of this tool.  It supports both
magic-number-based auto-decompression of known formats and sub-command
inputs through auto-maintained fifos.  It supports both command-line
argument interpretation and --files0-from=F interpretation.  It
supports explicit directives to override interpretation precedence.

Is this something that might be worth including in coreutils?

Thanks,

Bo
From 0a16d0698590f137e04c8351f8f14383147e827f Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 6 Apr 2008 17:54:08 -0400
Subject: [PATCH] Add new program: magic

* AUTHORS: Register as the author.
* NEWS: Advertise new program.
* README: List new program.
* doc/coreutils.texi: Describe new program.
* man/Makefile.am: Add new program.
* man/magic.x: Add new man page template.
* po/POTFILES.in: Add new program.
* src/Makefile.am: Add new program.
* src/magic.c: Add new program.
* tests/misc/Makefile.am: Add new test.
* tests/misc/help-version: Accomodate new program.
* tests/misc/magic: Test new program.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 AUTHORS |1 +
 NEWS|3 +
 README  |2 +-
 doc/coreutils.texi  |   73 -
 man/Makefile.am |1 +
 man/magic.x |4 +
 po/POTFILES.in  |1 +
 src/Makefile.am |2 +-
 src/magic.c |  978 +++
 tests/misc/Makefile.am  |1 +
 tests/misc/help-version |1 +
 tests/misc/magic|   89 +
 12 files changed, 1152 insertions(+), 4 deletions(-)
 create mode 100644 man/magic.x
 create mode 100644 src/magic.c
 create mode 100755 tests/misc/magic

diff --git a/AUTHORS b/AUTHORS
index 807857f..a79bec3 100644
--- a/AUTHORS
+++ b/AUTHORS
@@ -42,6 +42,7 @@ link: Michael Stone
 ln: Mike Parker, David MacKenzie
 logname: FIXME: unknown
 ls: Richard Stallman, David MacKenzie
+magic: Bo Borgerson
 md5sum: Ulrich Drepper, Scott Miller, David Madore
 mkdir: David MacKenzie
 mkfifo: David MacKenzie
diff --git a/NEWS b/NEWS
index e208b30..48affd5 100644
--- a/NEWS
+++ b/NEWS
@@ -82,6 +82,9 @@ GNU coreutils NEWS-*- outline -*-
   Fix a non-portable use of sed in configure.ac.
   [bug introduced in coreutils-6.9.92]
 
+** New programs
+
+magic: run a program with multiple piped inputs
 
 * Noteworthy changes in release 6.9.92 (2008-01-12) [beta]
 
diff --git a/README b/README
index 7a608f4..548b832 100644
--- a/README
+++ b/README
@@ -10,7 +10,7 @@ The programs that can be built with this package are:
   [ arch base64 basename cat chcon chgrp chmod chown chroot cksum comm cp
   csplit cut date dd df dir dircolors dirname du echo env expand expr
   factor false fmt fold groups head hostid hostname id install join kill
-  link ln logname ls md5sum mkdir mkfifo mknod mktemp mv nice nl nohup
+  link ln logname ls magic md5sum mkdir mkfifo mknod mktemp mv nice nl nohup
   od paste pathchk pinky pr printenv printf ptx pwd readlink rm rmdir
   runcon seq sha1sum sha224sum sha256sum sha384sum sha512sum shred shuf
   sleep sort split stat stty su sum sync tac tail tee test touch tr true
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 5a6f2c3..274606b 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -75,6 +75,7 @@
 * ln: (coreutils)ln invocation. Make links between files.
 * logname: (coreutils)logname invocation.   Print current login name.
 * ls: (coreutils)ls invocation. List directory contents.
+* magic: (coreutils)magic invocation.   Run with piped inputs

[PATCH] Add support for --output-delimiter=STR to comm

2008-04-07 Thread Bo Borgerson
Hi,

I submitted a version of this on Friday.

This updated version includes the following changes:

* Separate fputs in usage to stay away from ~500 char limit.
* More appropriate commit message.
* No --null-output-delimiter option.  Just the --output-delimiter=STR,
as specified in the TODO item.

Thanks,

Bo
From 4edd3361a500d31b5bc2b645e93c4aef02c00cab Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Fri, 4 Apr 2008 20:40:58 -0400
Subject: [PATCH] Add support for --output-delimiter=STR to comm

* src/comm.c: (static char *delimiter) Points to the delimiter string.
* tests/misc/comm: Add new test file for comm.
* tests/misc/Makefile.am: Run comm tests.
* doc/coreutils.texi: Document new option.
* NEWS: Advertise new option.
* TODO: Remove associated item.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS   |3 +
 TODO   |5 --
 doc/coreutils.texi |   12 +
 src/comm.c |   45 ++---
 tests/misc/Makefile.am |1 +
 tests/misc/comm|  106 
 6 files changed, 161 insertions(+), 11 deletions(-)
 create mode 100755 tests/misc/comm

diff --git a/NEWS b/NEWS
index e208b30..48b65f4 100644
--- a/NEWS
+++ b/NEWS
@@ -55,6 +55,9 @@ GNU coreutils NEWS-*- outline -*-
   options --general-numeric-sort/-g, --month-sort/-M, --numeric-sort/-n
   and --random-sort/-R, resp.
 
+  comm accepts new option, --output-delimiter=STR, that allows specification
+  of an output delimiter other than the default single TAB.
+
 ** Improvements
 
   id and groups work around an AFS-related bug whereby those programs
diff --git a/TODO b/TODO
index 86320b9..ffbdccf 100644
--- a/TODO
+++ b/TODO
@@ -14,11 +14,6 @@ document the following in coreutils.texi:
   uptime
 Also document the SELinux changes.
 
-comm: add an option, --output-delimiter=STR
-  Files to change: src/comm.c, ChangeLog, NEWS, doc/coreutils.texi,
-  Add a new file, tests/misc/comm (use another file in that directory as
-  a template), to exercise the new option.  Suggestion from Dan Jacobson.
-
 printf:
   Now that gnulib supports *printf(%a), import one of the
   *printf-posix modules so that printf(1) will support %a even on
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index ee7dbb2..93dd521 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -4332,6 +4332,18 @@ Columns are separated by a single TAB character.
 The options @option{-1}, @option{-2}, and @option{-3} suppress printing of
 the corresponding columns.  Also see @ref{Common options}.
 
+Other options are:
+
[EMAIL PROTECTED] @samp
+
[EMAIL PROTECTED] [EMAIL PROTECTED]
+Columns will be delimited by @var{str} in output, rather than the default
+single TAB character.
+
+The delimiter @var{str} may not be empty.
+
[EMAIL PROTECTED] table
+
 Unlike some other comparison utilities, @command{comm} has an exit
 status that does not depend on the result of the comparison.
 Upon normal completion @command{comm} produces an exit code of zero.
diff --git a/src/comm.c b/src/comm.c
index cbda362..0a526cf 100644
--- a/src/comm.c
+++ b/src/comm.c
@@ -52,8 +52,22 @@ static bool only_file_2;
 /* If true, print lines that are found in both files. */
 static bool both;
 
+/* Output columns will be delimited with this string, which may be set
+   on the command-line with --output-delimiter=STR.  The default is a
+   single TAB character. */
+static char *delimiter;
+
+/* For long options that have no equivalent short option, use a
+   non-character as a pseudo short option, starting with CHAR_MAX + 1.  */
+enum
+{
+  OUTPUT_DELIMITER_OPTION = CHAR_MAX + 1
+};
+
+
 static struct option const long_options[] =
 {
+  {output-delimiter, required_argument, NULL, OUTPUT_DELIMITER_OPTION},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
   {NULL, 0, NULL, 0}
@@ -88,6 +102,10 @@ and column three contains lines common to both files.\n\
   -2  suppress lines unique to FILE2\n\
   -3  suppress lines that appear in both files\n\
 ), stdout);
+  fputs(_(\
+\n\
+  --output-delimiter=STR  separate columns with STR\n\
+), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
   fputs (VERSION_OPTION_DESCRIPTION, stdout);
   emit_bug_reporting_address ();
@@ -113,20 +131,20 @@ writeline (const struct linebuffer *line, FILE *stream, int class)
 case 2:
   if (!only_file_2)
 	return;
-  /* Print a TAB if we are printing lines from file 1.  */
+  /* Print a delimiter if we are printing lines from file 1.  */
   if (only_file_1)
-	putc ('\t', stream);
+	fputs (delimiter, stream);
   break;
 
 case 3:
   if (!both)
 	return;
-  /* Print a TAB if we are printing lines from file 1.  */
+  /* Print a delimiter if we are printing lines from file 1.  */
   if (only_file_1)
-	putc ('\t', stream);
-  /* Print a TAB if we are printing lines from file 2

Re: Bug#474436: coreutils: ls --time-style=locale no longer works

2008-04-06 Thread Bo Borgerson
Hi,

This can affect invocations of `ls' that don't include an explicit
`--time-style=locale', as well, since that is now the default (in
absence of a TIME_STYLE environment variable).

The only case I can see in the head revision of `ls.c' where the
default English time-style is used is when there is no `hard locale'
in effect for time (LC_TIME=C or POSIX).

In case it might be desirable to have the time-style default to N_(%b
%e  %Y) and  N_(%b %e %H:%M) (the `defaults') rather than
'posix-long-iso' (%Y-%m-%d %H:%M in all cases) for locales that
begin with 'en_', I've included a small patch that provides this
behavior.


$ LC_ALL=zh_CN.UTF-8 src/ls -l TODO
-rw-r--r-- 1 bo bo 6908 2008-03-31 21:09 TODO

$ LC_ALL=en_US.UTF-8 src/ls -l TODO
-rw-r--r-- 1 bo bo 6908 Mar 31 21:09 TODO

$ LC_ALL=C src/ls -l TODO
-rw-r--r-- 1 bo bo 6908 Mar 31 21:09 TODO


Thanks,

Bo
From d239d2fffcee499da5f3c5b4ddacc9caa258f3b0 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Sun, 6 Apr 2008 11:47:28 -0400
Subject: [PATCH] Use default English time-formats for `en_*' locales

src/ls.c: (decode_switches) only goto case_long_iso_time_style
from `locale' time-format setting when the untranslated locale
is not an `en_*' locale.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 src/ls.c |   20 +---
 1 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/src/ls.c b/src/ls.c
index e029fe0..e5fbd0d 100644
--- a/src/ls.c
+++ b/src/ls.c
@@ -1924,16 +1924,30 @@ decode_switches (int argc, char **argv)
 	if (hard_locale (LC_TIME))
 	  {
 		/* Ensure that the locale has translations for both
-		   formats.  If not, fall back on long-iso format.  */
+		   formats.  If not, fall back on either the default
+		   format for en_* locales or on long-iso format for
+		   non-en_* locales.  */
 		int i;
+		char *full_cutover[2] = {NULL, NULL};
+		char const *lc_time = setlocale (LC_TIME, NULL);
+		char const *lc_en_prefix = en_;
+
 		for (i = 0; i  2; i++)
 		  {
 		char const *locale_format =
 		  dcgettext (NULL, long_time_format[i], LC_TIME);
 		if (locale_format == long_time_format[i])
-		  goto case_long_iso_time_style;
-		long_time_format[i] = locale_format;
+		  break;
+		full_cutover[i] = (char *) locale_format;
+		  }
+		if (full_cutover[0]  full_cutover[1])
+		  {
+		long_time_format[0] = full_cutover[0];
+		long_time_format[1] = full_cutover[1];
 		  }
+		else if (strncmp (lc_time, lc_en_prefix,
+		strlen (lc_en_prefix)))
+		  goto case_long_iso_time_style;
 	  }
 	  }
 }
-- 
1.5.2.5

___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Bug#474436: coreutils: ls --time-style=locale no longer works

2008-04-06 Thread Bo Borgerson
On Sun, Apr 6, 2008 at 12:35 PM, Jim Meyering [EMAIL PROTECTED] wrote:
  Thanks.
  That feels pretty kludgy.  I hope we end up with something cleaner.

Yeah, I suppose so.  Short of including `translations' for English,
though, what's a better option?

  BTW, such a patch would almost certainly require doc and test changes
  (a test addition, at least), as well as a NEWS entry.  If no one comes
  up with something better in the next few weeks, we can revisit this.

Okay, thanks.

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Bug#474436: coreutils: ls --time-style=locale no longer works

2008-04-06 Thread Bo Borgerson
On Sun, Apr 6, 2008 at 1:25 PM, Michael Stone [EMAIL PROTECTED] wrote:
  Yeah, I suppose so.  Short of including `translations' for English,
  though, what's a better option?
 

  What's the downside to that?
  Mike Stone


Good question.  My thought was because there aren't any now, but I
guess that's not necessarily an adequate reason.  :)

I actually don't know what the options are here, really.  Is it
possible to maintain an English translation with only LC_TIME info?

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [PATCH] Standardize some error messages.

2008-04-06 Thread Bo Borgerson
On Sun, Apr 6, 2008 at 1:10 PM, Jim Meyering [EMAIL PROTECTED] wrote:
  Please use VC_LIST_EXCEPT, so that the checks look only
  at version-controlled files.  That also provides a method
  for exceptions: see the existing .x-sc_* files.

Ah, that's much nicer.  Thanks.

  s/Likwise/Likewise/ ;-)

Yikes. :)

Bo
From fda400023db314046b6792b1b51c242b2bb62996 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Fri, 4 Apr 2008 11:13:13 -0400
Subject: [PATCH] Standardize some error messages.

* maint.mk: (sc_error_message_warn_fatal, sc_error_message_uppercase):
(sc_error_message_period): Add automatic checks for non-standard error messages.
* .x-sc_error_message_uppercase: explicit exclusion for this check
* src/cp.c: Standardize some error messages.
* src/date.c: Likewise.
* src/dircolors.c: Likewise.
* src/du.c: Likewise.
* src/expr.c: Likewise.
* src/install.c: Likewise.
* src/join.c: Likewise.
* src/ln.c: Likewise.
* src/mv.c: Likewise.
* src/od.c: Likewise.
* src/pr.c: Likewise.
* src/split.c: Likewise.
* src/wc.c: Likewise.
* tests/du/files0-from: Expect new error message.
* tests/misc/wc-files0-from: Likewise.
* tests/misc/xstrtol: Likewise.
* lib/xmemxfrm.c: Likewise.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 .x-sc_error_message_uppercase |1 +
 lib/xmemxfrm.c|4 ++--
 maint.mk  |   23 +++
 src/cp.c  |2 +-
 src/date.c|4 ++--
 src/dircolors.c   |4 ++--
 src/du.c  |2 +-
 src/expr.c|4 ++--
 src/install.c |   10 +-
 src/join.c|2 +-
 src/ln.c  |2 +-
 src/mv.c  |2 +-
 src/od.c  |2 +-
 src/pr.c  |8 
 src/split.c   |2 +-
 src/wc.c  |2 +-
 tests/du/files0-from  |2 +-
 tests/misc/wc-files0-from |2 +-
 tests/misc/xstrtol|3 +--
 19 files changed, 52 insertions(+), 29 deletions(-)
 create mode 100644 .x-sc_error_message_uppercase

diff --git a/.x-sc_error_message_uppercase b/.x-sc_error_message_uppercase
new file mode 100644
index 000..2452230
--- /dev/null
+++ b/.x-sc_error_message_uppercase
@@ -0,0 +1 @@
+build-aux/cvsu
diff --git a/lib/xmemxfrm.c b/lib/xmemxfrm.c
index 039f978..84f5158 100644
--- a/lib/xmemxfrm.c
+++ b/lib/xmemxfrm.c
@@ -52,9 +52,9 @@ xmemxfrm (char *restrict dest, size_t destsize,
   if (errno)
 {
   error (0, errno, _(string transformation failed));
-  error (0, 0, _(Set LC_ALL='C' to work around the problem.));
+  error (0, 0, _(set LC_ALL='C' to work around the problem));
   error (exit_failure, 0,
-	 _(The untransformed string was %s.),
+	 _(the untransformed string was %s),
 	 quotearg_n_style_mem (0, locale_quoting_style, src, srcsize));
 }
 
diff --git a/maint.mk b/maint.mk
index 6933a3c..c8fe53e 100644
--- a/maint.mk
+++ b/maint.mk
@@ -143,6 +143,29 @@ sc_error_exit_success:
 	  { echo '$(ME): found error (EXIT_SUCCESS' 12;		\
 	exit 1; } || :
 
+# `FATAL:' should be fully upper-cased in error messages
+# `WARNING:' should be fully upper-cased, or fully lower-cased
+sc_error_message_warn_fatal:
+	@grep -nEA2 '[^rp]error \(' $$($(VC_LIST_EXCEPT))		\
+	| grep -E 'Warning|Fatal|fatal' 			\
+	  { echo '$(ME): use FATAL, WARNING or warning'	12;		\
+	exit 1; } || :
+
+# Error messages should not start with a capital letter
+sc_error_message_uppercase:
+	@grep -nEA2 '[^rp]error \(' $$($(VC_LIST_EXCEPT))		\
+	| grep -E '[A-Z]'		\
+	| grep -vE 'FATAL|WARNING|Java|C#|PRIuMAX' 		\
+	  { echo '$(ME): found capitalized error message' 12;		\
+	exit 1; } || :
+
+# Error messages should not end with a period
+sc_error_message_period:
+	@grep -nEA2 '[^rp]error \(' $$($(VC_LIST_EXCEPT))		\
+	| grep -E '[^.]\.' 	\
+	  { echo '$(ME): found error message ending in period' 12;	\
+	exit 1; } || :
+
 sc_file_system:
 	@grep -ni 'file''system' $$($(VC_LIST_EXCEPT)) 		\
 	  { echo '$(ME): found use of file''system;'			\
diff --git a/src/cp.c b/src/cp.c
index 3f95871..6dd2e7e 100644
--- a/src/cp.c
+++ b/src/cp.c
@@ -592,7 +592,7 @@ do_copy (int n_files, char **file, const char *target_directory,
 {
   if (target_directory)
 	error (EXIT_FAILURE, 0,
-	   _(Cannot combine --target-directory (-t) 
+	   _(cannot combine --target-directory (-t) 
 		 and --no-target-directory (-T)));
   if (2  n_files)
 	{
diff --git a/src/date.c b/src/date.c
index ba88eb8..f9973f9 100644
--- a/src/date.c
+++ b/src/date.c
@@ -444,8 +444,8 @@ main (int argc, char **argv)
 	{
 	  error (0, 0,
 		 _(the argument %s lacks a leading `+';\n
-		   When using an option to specify date(s), any non-option\n
-		   argument must be a format string beginning with `+'.),
+		   when using an option to specify date(s), any non

Re: [PATCH] add new sort option --xargs (-x)

2008-04-06 Thread Bo Borgerson
I had a capitalized error message in this patch.
I also didn't use a correct commit message format.

Thanks

Bo
From 9a37b547bcc892d1d5e2542c43d77b13497318db Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Thu, 3 Apr 2008 18:42:57 -0400
Subject: [PATCH] Add new sort option --files0-from=F

* src/sort.c: support new option
* tests/misc/sort-files0-from: test new option
* tests/misc/Makefile.am: indicate new test
* docs/coreutils.texti: explain new option
* NEWS: advertise new option

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS|5 ++
 doc/coreutils.texi  |   16 +++
 src/sort.c  |   57 ++-
 tests/misc/Makefile.am  |1 +
 tests/misc/sort-files0-from |  105 +++
 5 files changed, 182 insertions(+), 2 deletions(-)
 create mode 100755 tests/misc/sort-files0-from

diff --git a/NEWS b/NEWS
index e208b30..492c4e9 100644
--- a/NEWS
+++ b/NEWS
@@ -55,6 +55,11 @@ GNU coreutils NEWS-*- outline -*-
   options --general-numeric-sort/-g, --month-sort/-M, --numeric-sort/-n
   and --random-sort/-R, resp.
 
+  sort accepts a new option, --files0-from=F, that specifies a file
+  containing a null-separated list of files to sort.  This list is used
+  instead of filenames passed on the command-line to avoid problems with
+  maximum command-line (argv) length.
+
 ** Improvements
 
   id and groups work around an AFS-related bug whereby those programs
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index ee7dbb2..5415394 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3667,6 +3667,22 @@ Terminate with an error if @var{prog} exits with nonzero status.
 Whitespace and the backslash character should not appear in
 @var{prog}; they are reserved for future use.
 
[EMAIL PROTECTED] [EMAIL PROTECTED]
[EMAIL PROTECTED] [EMAIL PROTECTED]
[EMAIL PROTECTED] including files from @command{du}
+Rather than processing files named on the command line, process those
+named in file @var{FILE}; each name is terminated by a null byte.
+This is useful when the list of file names is so long that it may exceed
+a command line length limitation.
+In such cases, running @command{sort} via @command{xargs} is undesirable
+because it splits the list into pieces and gives each piece to a different
+instance of @command{sort}, with the resulting output being multiple sets
+of sorted data concatenated together.
+One way to produce a list of null-byte-terminated file names is with @sc{gnu}
[EMAIL PROTECTED], using its @option{-print0} predicate.
+
+Do not specify any @var{FILE} on the command line when using this option.
+
 @item -k @var{pos1}[,@var{pos2}]
 @itemx [EMAIL PROTECTED],@var{pos2}]
 @opindex -k
diff --git a/src/sort.c b/src/sort.c
index 8b2eec5..c14a8d3 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -37,6 +37,7 @@
 #include posixver.h
 #include quote.h
 #include randread.h
+#include readtokens0.h
 #include stdio--.h
 #include stdlib--.h
 #include strnumcmp.h
@@ -304,8 +305,9 @@ usage (int status)
 {
   printf (_(\
 Usage: %s [OPTION]... [FILE]...\n\
+  or:  %s [OPTION]... --files0-from=F\n\
 ),
-	  program_name);
+	  program_name, program_name);
   fputs (_(\
 Write sorted concatenation of all FILE(s) to standard output.\n\
 \n\
@@ -342,6 +344,8 @@ Other options:\n\
   -C, --check=quiet, --check=silent  like -c, but do not report first bad line\n\
   --compress-program=PROG  compress temporaries with PROG;\n\
   decompress them with PROG -d\n\
+  --files0-from=F   read input from the files specified by\n\
+NUL-terminated names in file F\n\
   -k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)\n\
   -m, --merge   merge already sorted files; do not sort\n\
 ), stdout);
@@ -395,7 +399,8 @@ enum
   CHECK_OPTION = CHAR_MAX + 1,
   COMPRESS_PROGRAM_OPTION,
   RANDOM_SOURCE_OPTION,
-  SORT_OPTION
+  SORT_OPTION,
+  FILES0_FROM_OPTION
 };
 
 static char const short_options[] = -bcCdfgik:mMno:rRsS:t:T:uy:z;
@@ -407,6 +412,7 @@ static struct option const long_options[] =
   {compress-program, required_argument, NULL, COMPRESS_PROGRAM_OPTION},
   {dictionary-order, no_argument, NULL, 'd'},
   {ignore-case, no_argument, NULL, 'f'},
+  {files0-from, required_argument, NULL, FILES0_FROM_OPTION},
   {general-numeric-sort, no_argument, NULL, 'g'},
   {ignore-nonprinting, no_argument, NULL, 'i'},
   {key, required_argument, NULL, 'k'},
@@ -2752,6 +2758,8 @@ main (int argc, char **argv)
   bool posixly_correct = (getenv (POSIXLY_CORRECT) != NULL);
   bool obsolete_usage = (posix2_version ()  200112);
   char **files;
+  char *files_from = NULL;
+  struct Tokens tok;
   char const *outfile = NULL;
 
   initialize_main (argc, argv);
@@ -2955,6 +2963,10 @@ main (int argc, char **argv)
 	  compress_program = optarg;
 	  break;
 
+	case

Re: [PATCH] add new sort option --xargs (-x)

2008-04-06 Thread Bo Borgerson
On Sun, Apr 6, 2008 at 4:30 PM, Jim Meyering [EMAIL PROTECTED] wrote:
  s/texti/texi/
  Please use capitals and periods in ChangeLogs. ;-)
  s/null/NUL/
  Split the string.  Otherwise, your addition pushes its length beyond
  a portability limit whose exact number I forget but it's around 500.
  No big deal, but it's good practice to alphabetize.
  This should have only 1 year number: 2008.

Thanks.  I'll try to catch this sort of thing myself in the future.


  Now, your doc change will be to add this line:

  @files0fromOption{sort,}

I added an argument to the macro that specifies output for sub-lists,
since it's 'a total' for wc and du, but 'sorted output' for sort.


  If the file is based on some other, please indicate that.
  That will help me as reviewer, and future maintainers.

  E.g., I put this comment in the wc test of --files0-from:

   # This file bears a striking resemblance to tests/du/files0-from.

Unfortunately I didn't just copy one of the relevant test files.  If
it would be easier for maintenance to have a more direct copy I can
redo it.  I added a line at the top indicating that this test script
covers a lot of the same ground as the wc-files0-from tests.

Thanks,

Bo
From 404e23daf6874e4d36e2048de569bcac057b7400 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Thu, 3 Apr 2008 18:42:57 -0400
Subject: [PATCH] Add new sort option --files0-from=F

* src/sort.c: Support new option.
* tests/misc/sort-files0-from: Test new option.
* tests/misc/Makefile.am: Indicate new test.
* docs/coreutils.texi: Explain new option.
* NEWS: Advertise new option.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS|5 ++
 doc/coreutils.texi  |   12 +++--
 src/sort.c  |   65 ---
 tests/misc/Makefile.am  |1 +
 tests/misc/sort-files0-from |  106 +++
 5 files changed, 178 insertions(+), 11 deletions(-)
 create mode 100755 tests/misc/sort-files0-from

diff --git a/NEWS b/NEWS
index e208b30..492c4e9 100644
--- a/NEWS
+++ b/NEWS
@@ -55,6 +55,11 @@ GNU coreutils NEWS-*- outline -*-
   options --general-numeric-sort/-g, --month-sort/-M, --numeric-sort/-n
   and --random-sort/-R, resp.
 
+  sort accepts a new option, --files0-from=F, that specifies a file
+  containing a null-separated list of files to sort.  This list is used
+  instead of filenames passed on the command-line to avoid problems with
+  maximum command-line (argv) length.
+
 ** Improvements
 
   id and groups work around an AFS-related bug whereby those programs
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index 5a6f2c3..9ac7bbf 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3074,7 +3074,7 @@ Print only the newline counts.
 @opindex --max-line-length
 Print only the maximum line lengths.
 
[EMAIL PROTECTED] files0fromOption{cmd,withTotalOption}
[EMAIL PROTECTED] files0fromOption{cmd,withTotalOption,subListOutput}
 @itemx [EMAIL PROTECTED]
 @opindex [EMAIL PROTECTED]
 @cindex including files from @command{\cmd\}
@@ -3084,13 +3084,13 @@ This is useful \withTotalOption\
 when the list of file names is so long that it may exceed a command line
 length limitation.
 In such cases, running @command{\cmd\} via @command{xargs} is undesirable
-because it splits the list into pieces and makes @command{\cmd\} print a
-total for each sublist rather than for the entire list.
+because it splits the list into pieces and makes @command{\cmd\} print
+\subListOutput\ for each sublist rather than for the entire list.
 One way to produce a list of null-byte-terminated file names is with @sc{gnu}
 @command{find}, using its @option{-print0} predicate.
 Do not specify any @var{FILE} on the command line when using this option.
 @end macro
[EMAIL PROTECTED],}
[EMAIL PROTECTED],,a total}
 
 For example, to find the length of the longest line in any @file{.c} or
 @file{.h} file in the current hierarchy, do this:
@@ -3670,6 +3670,8 @@ Terminate with an error if @var{prog} exits with nonzero status.
 Whitespace and the backslash character should not appear in
 @var{prog}; they are reserved for future use.
 
[EMAIL PROTECTED],,sorted output}
+
 @item -k @var{pos1}[,@var{pos2}]
 @itemx [EMAIL PROTECTED],@var{pos2}]
 @opindex -k
@@ -9757,7 +9759,7 @@ Does not affect other symbolic links.  This is helpful for finding
 out the disk usage of directories, such as @file{/usr/tmp}, which
 are often symbolic links.
 
[EMAIL PROTECTED], with the @option{--total} (@option{-c}) option}
[EMAIL PROTECTED], with the @option{--total} (@option{-c}) option,a total}
 
 @optHumanReadable
 
diff --git a/src/sort.c b/src/sort.c
index 8b2eec5..e67ce80 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -37,6 +37,7 @@
 #include posixver.h
 #include quote.h
 #include randread.h
+#include readtokens0.h
 #include stdio--.h
 #include stdlib--.h
 #include strnumcmp.h
@@ -304,8 +305,9 @@ usage (int status

Re: [PATCH] Standardize some error messages.

2008-04-04 Thread Bo Borgerson
On Fri, Apr 4, 2008 at 12:38 PM, Jim Meyering [EMAIL PROTECTED] wrote:
  Have you received/sent the paper yet?

Yep, I sent it out this morning.

  So how about adding an sc_*** rule for this in maint.mk?

I added three so the failure message would reflect which of the
conventions appeared to have been violated.

I extended each check by two lines to catch things like:

error (EXIT_FAILURE, 0,
_(Some very long
   error message.));

I had to add explicit exclusions for some false positives.  There will
likely be some case in the future where the correct adjustment is to
the sc_*** rule rather than to the code it complained about.

Bo
From 41f5f14c10a47335fd1b19130201a3a57202e8c9 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Fri, 4 Apr 2008 11:13:13 -0400
Subject: [PATCH] Standardize some error messages.

maint.mk: Add automatic checks for non-standard error messages.
src/cp.c: Standardize some error messages.
src/date.c: Likewise.
src/dircolors.c: Likewise.
src/du.c: Likewise.
src/expr.c: Likewise.
src/install.c: Likewise.
src/join.c: Likewise.
src/ln.c: Likewise.
src/mv.c: Likewise.
src/od.c: Likewise.
src/pr.c: Likewise.
src/split.c: Likewise.
src/wc.c: Likewise.
tests/du/files0-from: Expect new error message.
tests/misc/wc-files0-from: Likwise.
tests/misc/xstrtol: Likwise.
lib/xmemxfrm.c: Likwise.

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 lib/xmemxfrm.c|4 ++--
 maint.mk  |   24 
 src/cp.c  |2 +-
 src/date.c|4 ++--
 src/dircolors.c   |4 ++--
 src/du.c  |2 +-
 src/expr.c|4 ++--
 src/install.c |   10 +-
 src/join.c|2 +-
 src/ln.c  |2 +-
 src/mv.c  |2 +-
 src/od.c  |2 +-
 src/pr.c  |8 
 src/split.c   |2 +-
 src/wc.c  |2 +-
 tests/du/files0-from  |2 +-
 tests/misc/wc-files0-from |2 +-
 tests/misc/xstrtol|3 +--
 18 files changed, 52 insertions(+), 29 deletions(-)

diff --git a/lib/xmemxfrm.c b/lib/xmemxfrm.c
index 039f978..84f5158 100644
--- a/lib/xmemxfrm.c
+++ b/lib/xmemxfrm.c
@@ -52,9 +52,9 @@ xmemxfrm (char *restrict dest, size_t destsize,
   if (errno)
 {
   error (0, errno, _(string transformation failed));
-  error (0, 0, _(Set LC_ALL='C' to work around the problem.));
+  error (0, 0, _(set LC_ALL='C' to work around the problem));
   error (exit_failure, 0,
-	 _(The untransformed string was %s.),
+	 _(the untransformed string was %s),
 	 quotearg_n_style_mem (0, locale_quoting_style, src, srcsize));
 }
 
diff --git a/maint.mk b/maint.mk
index 6933a3c..1921fd7 100644
--- a/maint.mk
+++ b/maint.mk
@@ -143,6 +143,30 @@ sc_error_exit_success:
 	  { echo '$(ME): found error (EXIT_SUCCESS' 12;		\
 	exit 1; } || :
 
+# `FATAL:' should be fully upper-cased in error messages
+# `WARNING:' should be fully upper-cased, or fully lower-cased
+sc_error_message_warn_fatal:
+	@grep -nEA2 '[^rp]error \(' $$(find -type f -name '*.[chly]')	\
+	| grep -E 'Warning|Fatal|fatal' 			\
+	  { echo '$(ME): use FATAL, WARNING or warning'	12;		\
+	exit 1; } || :
+
+# Error messages should not start with a capital letter
+sc_error_message_uppercase:
+	@grep -nEA2 '[^rp]error \(' $$(find -type f -name '*.[chly]')	\
+	| grep -E '[A-Z]'		\
+	| grep -vE 'FATAL|WARNING'\
+	| grep -vE 'Java|C#|strerror.c|w32spawn.h|PRIuMAX' 	\
+	  { echo '$(ME): found capitalized error message' 12;		\
+	exit 1; } || :
+
+# Error messages should not end with a period
+sc_error_message_period:
+	@grep -nEA2 '[^rp]error \(' $$(find -type f -name '*.[chly]')	\
+	| grep -E '[^.]\.'	| grep -vE '\.' 			\
+	  { echo '$(ME): found error message ending in period' 12;	\
+	exit 1; } || :
+
 sc_file_system:
 	@grep -ni 'file''system' $$($(VC_LIST_EXCEPT)) 		\
 	  { echo '$(ME): found use of file''system;'			\
diff --git a/src/cp.c b/src/cp.c
index 3f95871..6dd2e7e 100644
--- a/src/cp.c
+++ b/src/cp.c
@@ -592,7 +592,7 @@ do_copy (int n_files, char **file, const char *target_directory,
 {
   if (target_directory)
 	error (EXIT_FAILURE, 0,
-	   _(Cannot combine --target-directory (-t) 
+	   _(cannot combine --target-directory (-t) 
 		 and --no-target-directory (-T)));
   if (2  n_files)
 	{
diff --git a/src/date.c b/src/date.c
index ba88eb8..f9973f9 100644
--- a/src/date.c
+++ b/src/date.c
@@ -444,8 +444,8 @@ main (int argc, char **argv)
 	{
 	  error (0, 0,
 		 _(the argument %s lacks a leading `+';\n
-		   When using an option to specify date(s), any non-option\n
-		   argument must be a format string beginning with `+'.),
+		   when using an option to specify date(s), any non-option\n
+		   argument must be a format string beginning with `+'),
 		 quote (argv[optind]));
 	  usage (EXIT_FAILURE

Re: [PATCH] Standardize some error messages.

2008-04-04 Thread Bo Borgerson
On Fri, Apr 4, 2008 at 12:38 PM, Jim Meyering [EMAIL PROTECTED] wrote:
  Please remove the #FIXME comment, now that you've fixed it ;-)

Ha, tunnel vision :)

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


[PATCH] add new sort option --xargs (-x)

2008-04-03 Thread Bo Borgerson
Hi,

The number of inputs that can be handled by the sort utility is
currently limited by what may be passed in argv.

Due to the nature of sort, this limit can't be stepped around with
`xargs' as it could be with some other utilities.

My solution to this locally has been to add an option to the sort
utility, --xargs, which causes sort to treat STDIN as a source of
newline-separated arguments that supplement those on the command-line
(please see attached patch).

Consider the following example with an input directory containing
16384 input files each consisting of a single line with a single
character, one of 'a', 'b' or 'c':

$ src/sort -mu input/*
bash: src/sort: Argument list too long

$ find input/ -type f | xargs src/sort -mu
a
b
c
a
b
c

$ find input/ -type f | src/sort -mu --xargs
a
b
c

Is this an option that might be worth including in a future release?

Thanks,

Bo
From 8568528acd4b5eea20d06136aaaf7b18a36f03c0 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Thu, 3 Apr 2008 12:05:55 -0400
Subject: [PATCH] add new sort option --xargs (-x)

* src/sort.c: if --xargs option, add input to FILES
* tests/misc/sort-xargs: test new option
* tests/misc/Makefile.am: add new test file
* doc/coreutils.texi: describe new option
* NEWS: advertise new option

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS   |4 +++
 doc/coreutils.texi |9 
 src/sort.c |   43 ++
 tests/misc/Makefile.am |1 +
 tests/misc/sort-xargs  |   49 
 5 files changed, 106 insertions(+), 0 deletions(-)
 create mode 100755 tests/misc/sort-xargs

diff --git a/NEWS b/NEWS
index e208b30..36d67f6 100644
--- a/NEWS
+++ b/NEWS
@@ -55,6 +55,10 @@ GNU coreutils NEWS-*- outline -*-
   options --general-numeric-sort/-g, --month-sort/-M, --numeric-sort/-n
   and --random-sort/-R, resp.
 
+  sort accepts a new option, --xargs (-x), that causes input to be treated
+  as a newline-separated list of files to supplement those passed on the
+  command-line.
+
 ** Improvements
 
   id and groups work around an AFS-related bug whereby those programs
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index ee7dbb2..eb3d41e 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3803,6 +3803,15 @@ For example, @code{sort -n -u} inspects only the value of the initial
 numeric string when checking for uniqueness, whereas @code{sort -n |
 uniq} inspects the entire line.  @xref{uniq invocation}.
 
[EMAIL PROTECTED] -x
[EMAIL PROTECTED] --xargs
[EMAIL PROTECTED] -x
[EMAIL PROTECTED] --xargs
[EMAIL PROTECTED] xargs standard input arguments
+Treat the input as a set of newline-separated arguments to supplement
+those on command-line. Useful if the list of input files to sort exceeds
+the command-line argument list size limit.
+
 @item -z
 @itemx --zero-terminated
 @opindex -z
diff --git a/src/sort.c b/src/sort.c
index 8b2eec5..183f56c 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -121,6 +121,9 @@ static bool hard_LC_COLLATE;
 static bool hard_LC_TIME;
 #endif
 
+/* If true, treat STDIN as a source of files */
+static bool xargs = false;
+
 #define NONZERO(x) ((x) != 0)
 
 /* The kind of blanks for '-b' to skip in various options. */
@@ -222,6 +225,10 @@ static struct month monthtab[] =
   {SEP, 9}
 };
 
+/* The maximum number of input files allowed for in an invocation
+   FIXME: This should be set more intelligently */
+#define NFILES_MAX 1048576
+
 /* During the merge phase, the number of files to merge at once. */
 #define NMERGE 16
 
@@ -358,6 +365,9 @@ Other options:\n\
   without -c, output only the first of an equal run\n\
 ), DEFAULT_TMPDIR);
   fputs (_(\
+  -x, --xargs   treat STDIN as a source of newline-separated\n\
+arguments to supplement arguments on the\n\
+command-line\n\
   -z, --zero-terminated end lines with 0 byte, not newline\n\
 ), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
@@ -423,6 +433,7 @@ static struct option const long_options[] =
   {field-separator, required_argument, NULL, 't'},
   {temporary-directory, required_argument, NULL, 'T'},
   {unique, no_argument, NULL, 'u'},
+  {xargs, no_argument, NULL, 'x'},
   {zero-terminated, no_argument, NULL, 'z'},
   {GETOPT_HELP_OPTION_DECL},
   {GETOPT_VERSION_OPTION_DECL},
@@ -3086,6 +3097,10 @@ main (int argc, char **argv)
 	}
 	  break;
 
+	case 'x':
+	  xargs = true;
+	  break;
+
 	case 'z':
 	  eolchar = 0;
 	  break;
@@ -3099,6 +3114,34 @@ main (int argc, char **argv)
 	}
 }
 
+  if (xargs)
+{
+  size_t xargc = argc;
+  char input_line[LINE_MAX];
+  int i, length;
+
+  while (fgets (input_line, LINE_MAX, stdin))
+	{
+
+	  if (nfiles = NFILES_MAX)
+	error (SORT_FAILURE, 0, _(Too many input files));
+
+	  if (nfiles = xargc)
+	files

Re: [PATCH] add new sort option --xargs (-x)

2008-04-03 Thread Bo Borgerson
On Thu, Apr 3, 2008 at 12:18 PM, Jim Meyering [EMAIL PROTECTED] wrote:
  I suppose you have a real application where this is useful?
  If so, please describe it -- motivation/justification helps ;-)


Just a merge with a lot of source files.  It's the same motivation as
the nmerge patch.  I've actually got another patch as well that I'll
clean up and offer soon that allows a merge of  nmerge files to be
divided among sub-processes whose output is then merged by the parent,
which provides a performance benefit (if you've got the resources for
it).  I've got yet another patch that adds an option to open
compressed files through a decompression program, so I don't have to
set up fifos for a merge of gzipped files.

I've been maintaining these patches against sort for a while,
re-patching whenever a new release is published.  I figured it would
be worth a shot seeing if I could get any them incorporated into the
package upstream. :)

I'm also just generally interested in helping out with maintenance.  I
think I'm probably less experienced than most of the regular
contributors but I can help with simple stuff and when it comes to
coding I think doing is the best way of learning.

  I think so.  du and wc each have the --files0-from=F option, added for
  the same reason.  Any such option in sort should have the same name and
  be implemented in the same way.

That seems reasonable enough.  Looks like readtokens0 does most of the
work for me. :)

How would you feel about also including a --filesn-from=F option to
support pipelines like the one in my example where the input is
newline separated?

  [haven't forgotten about --nmerge.  will get to it eventually ]

Thanks. :)

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [PATCH] add new sort option --xargs (-x)

2008-04-03 Thread Bo Borgerson
On Thu, Apr 3, 2008 at 2:05 PM, Jim Meyering [EMAIL PROTECTED] wrote:
   Sounds interesting.
   I suppose it can work with an arbitrary decompressor?
 
   Note this relatively new option:
 
--compress-program=PROG  compress temporaries with PROG;
decompress them with PROG -d
 

 Yep.

 My current convention is:

  --magic-open=PROG[,PROG]...

 So if you want to merge a gzip'd file with a bzip2'd file you can use
 --magic-open=gzip,bzip2 (or just --magic-open, which enables all).

 For each regular file it checks magic and if it looks like a type that
 can be handled by one of PROG it opens a PROG -d -c -f (the -f is just
 in case the magic was a false-positive).

 Of course this re-introduces findprog into sort (for find_in_path),
 which may not be desirable.  A convention more similar to that used
 for --compress-program would eliminate this (and the magic-checking),
 but limit a given merge to files compressed with a single program.

 Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [PATCH] add new sort option --xargs (-x)

2008-04-03 Thread Bo Borgerson
Okay, here's a version that supports argument input in --files0-from=F style.

Bo
From 3108b79cbbb5d6c2fe3c2f8d5037f166cb0f1ca6 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Thu, 3 Apr 2008 18:42:57 -0400
Subject: [PATCH] Add new sort option --files0-from=F

src/sort.c: support new option
tests/misc/sort-files0-from: test new option
tests/misc/Makefile.am: indicate new test
docs/coreutils.texti: explain new option
NEWS: advertise new option

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS|5 ++
 doc/coreutils.texi  |   16 +++
 src/sort.c  |   58 +++-
 tests/misc/Makefile.am  |1 +
 tests/misc/sort-files0-from |  105 +++
 5 files changed, 183 insertions(+), 2 deletions(-)
 create mode 100755 tests/misc/sort-files0-from

diff --git a/NEWS b/NEWS
index e208b30..492c4e9 100644
--- a/NEWS
+++ b/NEWS
@@ -55,6 +55,11 @@ GNU coreutils NEWS-*- outline -*-
   options --general-numeric-sort/-g, --month-sort/-M, --numeric-sort/-n
   and --random-sort/-R, resp.
 
+  sort accepts a new option, --files0-from=F, that specifies a file
+  containing a null-separated list of files to sort.  This list is used
+  instead of filenames passed on the command-line to avoid problems with
+  maximum command-line (argv) length.
+
 ** Improvements
 
   id and groups work around an AFS-related bug whereby those programs
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index ee7dbb2..5415394 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3667,6 +3667,22 @@ Terminate with an error if @var{prog} exits with nonzero status.
 Whitespace and the backslash character should not appear in
 @var{prog}; they are reserved for future use.
 
[EMAIL PROTECTED] [EMAIL PROTECTED]
[EMAIL PROTECTED] [EMAIL PROTECTED]
[EMAIL PROTECTED] including files from @command{du}
+Rather than processing files named on the command line, process those
+named in file @var{FILE}; each name is terminated by a null byte.
+This is useful when the list of file names is so long that it may exceed
+a command line length limitation.
+In such cases, running @command{sort} via @command{xargs} is undesirable
+because it splits the list into pieces and gives each piece to a different
+instance of @command{sort}, with the resulting output being multiple sets
+of sorted data concatenated together.
+One way to produce a list of null-byte-terminated file names is with @sc{gnu}
[EMAIL PROTECTED], using its @option{-print0} predicate.
+
+Do not specify any @var{FILE} on the command line when using this option.
+
 @item -k @var{pos1}[,@var{pos2}]
 @itemx [EMAIL PROTECTED],@var{pos2}]
 @opindex -k
diff --git a/src/sort.c b/src/sort.c
index 8b2eec5..8342399 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -37,6 +37,7 @@
 #include posixver.h
 #include quote.h
 #include randread.h
+#include readtokens0.h
 #include stdio--.h
 #include stdlib--.h
 #include strnumcmp.h
@@ -304,8 +305,9 @@ usage (int status)
 {
   printf (_(\
 Usage: %s [OPTION]... [FILE]...\n\
+  or:  %s [OPTION]... --files0-from=F\n\
 ),
-	  program_name);
+	  program_name, program_name);
   fputs (_(\
 Write sorted concatenation of all FILE(s) to standard output.\n\
 \n\
@@ -342,6 +344,9 @@ Other options:\n\
   -C, --check=quiet, --check=silent  like -c, but do not report first bad line\n\
   --compress-program=PROG  compress temporaries with PROG;\n\
   decompress them with PROG -d\n\
+  --files0-from=Fread input from the files specified by\n\
+   NUL-terminated names in file F\n\
+  -L, --max-line-length  print the length of the longest line\n\
   -k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)\n\
   -m, --merge   merge already sorted files; do not sort\n\
 ), stdout);
@@ -395,7 +400,8 @@ enum
   CHECK_OPTION = CHAR_MAX + 1,
   COMPRESS_PROGRAM_OPTION,
   RANDOM_SOURCE_OPTION,
-  SORT_OPTION
+  SORT_OPTION,
+  FILES0_FROM_OPTION
 };
 
 static char const short_options[] = -bcCdfgik:mMno:rRsS:t:T:uy:z;
@@ -407,6 +413,7 @@ static struct option const long_options[] =
   {compress-program, required_argument, NULL, COMPRESS_PROGRAM_OPTION},
   {dictionary-order, no_argument, NULL, 'd'},
   {ignore-case, no_argument, NULL, 'f'},
+  {files0-from, required_argument, NULL, FILES0_FROM_OPTION},
   {general-numeric-sort, no_argument, NULL, 'g'},
   {ignore-nonprinting, no_argument, NULL, 'i'},
   {key, required_argument, NULL, 'k'},
@@ -2752,6 +2759,8 @@ main (int argc, char **argv)
   bool posixly_correct = (getenv (POSIXLY_CORRECT) != NULL);
   bool obsolete_usage = (posix2_version ()  200112);
   char **files;
+  char *files_from = NULL;
+  struct Tokens tok;
   char const *outfile = NULL;
 
   initialize_main (argc, argv);
@@ -2955,6 +2964,10 @@ main (int argc, char **argv)
 	  compress_program = optarg;
 	  break

Re: [PATCH] Add timeout utility

2008-04-02 Thread Bo Borgerson
Pádraig Brady [EMAIL PROTECTED] wrote:
 Subject: [PATCH] Add new program: timeout

Great idea for a tool!

Have you considered an alternate run-mode where it could operate as a
filter and timeout on 'inactivity' of the pipeline?

If, for instance, I have a pipeline that processes a lot of data and
could legitimately take anywhere from a minute to an hour, it's
difficult to set an absolute timeout that doesn't risk chopping off
the end of the stream.  Then, with such a large timeout, my pipeline
could stall in the first ten seconds and I wouldn't know for a long
time.

If, on the other hand, I could say that there shouldn't ever be thirty
seconds without a buffer's worth of data coming through, then I could
set the timeout very low and know soon after a blockage formed.

For example:

$ sort -m inputs/* | timeout --inactivity 1m program_prone_to_stalling

Where timeout would open a pipe and dup2 the read end to child's
STDIN_FILENO before exec'ing.

If this sounds like a worthwhile extension I'd be happy to get to work
on it and submit a patch once your initial version has settled in.

Thanks,

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: [PATCH] Add timeout utility

2008-04-02 Thread Bo Borgerson
On Wed, Apr 2, 2008 at 10:20 AM, Pádraig Brady [EMAIL PROTECTED] wrote:
  It will always go through though as the kernel will buffer it.

Yes, that introduces some fuzz, but I think the principle remains
viable -- the kernel will only buffer so much.

Consider the following using a timeout.c modified with the attached
patch, and a small Perl program (below) than hangs after 10 seconds:

$ time yes | src/timeout -i 2s ./write_then_hang 10 /dev/null

real0m11.777s
user0m0.656s
sys 0m0.068s

Bo

-- Perl --

#!/usr/bin/perl -w

my $n = shift @ARGV || 10;

my $s = time;

print scalar(STDIN) while (time - $s  $n);

while(1){ }
diff --git a/src/timeout.c b/src/timeout.c
index 7c15f1d..1f44c27 100644
--- a/src/timeout.c
+++ b/src/timeout.c
@@ -75,6 +75,9 @@
 # define WTERMSIG(s) ((s)  0x7F)
 #endif
 
+/* Size of atomic reads. */
+#define BUFFER_SIZE (16 * 1024)
+
 static int timed_out;
 static int term_signal = SIGTERM;  /* same default as kill command.  */
 static int monitored_pid;
@@ -83,6 +86,7 @@ static char *program_name;
 
 static struct option const long_options[] = {
   {signal, required_argument, NULL, 's'},
+  {inactivity, no_argument, NULL, 'i'},
   {NULL, 0, NULL, 0}
 };
 
@@ -144,8 +148,10 @@ Mandatory arguments to long options are mandatory for short options too.\n\
   -s, --signal=SIGNAL\n\
specify the signal to be sent on timeout.\n\
SIGNAL may be a name like `HUP' or a number.\n\
-   See `kill -l` for a list of signals\n), stdout);
-
+   See `kill -l` for a list of signals\n\
+  -i, --inactivity\n\
+   act as a filter and only timeout if NUMBER\n\
+   seconds have passed without data flowing.\n), stdout);
   fputs (HELP_OPTION_DESCRIPTION, stdout);
   fputs (VERSION_OPTION_DESCRIPTION, stdout);
   fputs (_(\n\
@@ -250,6 +256,8 @@ main (int argc, char **argv)
 {
   unsigned long timeout;
   char signame[SIG2STR_MAX];
+  bool inactivity = false;
+  int pipefds[2];
   int c;
   char *ep;
 
@@ -265,7 +273,7 @@ main (int argc, char **argv)
   parse_long_options (argc, argv, PROGRAM_NAME, PACKAGE_NAME, VERSION,
   usage, AUTHORS, (char const *) NULL);
 
-  while ((c = getopt_long (argc, argv, +s:, long_options, NULL)) != -1)
+  while ((c = getopt_long (argc, argv, +s:i, long_options, NULL)) != -1)
 {
   switch (c)
 {
@@ -274,6 +282,9 @@ main (int argc, char **argv)
   if (term_signal == -1)
 usage (ECANCELED);
   break;
+case 'i':
+  inactivity = true;
+  break;
 default:
   usage (ECANCELED);
   break;
@@ -315,6 +326,12 @@ main (int argc, char **argv)
   signal (SIGTTIN, SIG_IGN);/* don't sTop if background child needs tty.  */
   signal (SIGTTOU, SIG_IGN);/* don't sTop if background child needs tty.  */
 
+  if (inactivity)
+{
+  if (pipe(pipefds) == -1)
+perror (pipe);
+}
+
   monitored_pid = fork ();
   if (monitored_pid == -1)
 {
@@ -325,6 +342,13 @@ main (int argc, char **argv)
 {   /* child */
   int exit_status;
 
+  if (inactivity)
+{
+  close (pipefds[1]);
+  close (STDIN_FILENO);
+  dup2 (pipefds[0], STDIN_FILENO);
+}
+
   /* exec doesn't reset SIG_IGN - SIG_DFL.  */
   signal (SIGTTIN, SIG_DFL);
   signal (SIGTTOU, SIG_DFL);
@@ -342,6 +366,21 @@ main (int argc, char **argv)
 
   alarm ((unsigned int) timeout);
 
+  if (inactivity)
+{
+  int bytes_read;
+  char buf[BUFFER_SIZE];
+  close (pipefds[0]);
+  close (STDOUT_FILENO);
+  dup2 (pipefds[1], STDOUT_FILENO);
+  while ((bytes_read = read(STDIN_FILENO, buf, BUFFER_SIZE))  0)
+{
+  if ((write(STDOUT_FILENO, buf, bytes_read)) == -1)
+perror (write);
+  alarm ((unsigned int) timeout);
+}
+}
+
   /* We're just waiting for a single process here, so wait() suffices.
* Note the signal() calls above on linux and BSD at least, essentially
* call the lower level sigaction() with the SA_RESTART flag set, which
___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Re: Modifiable NMERGE in sort

2008-04-01 Thread Bo Borgerson
On Tue, Apr 1, 2008 at 4:11 AM, Jim Meyering [EMAIL PROTECTED] wrote:
  So please see if you can find an option name that does not start with
  --merge, and, ideally, that doesn't influence interpretation of any
  other option abbreviations.


How about --batch-size.  It doesn't have an initial substring that
coincides with more than a character of an existing longopt, and there
aren't really any other 'batches' in sort aside from these merge
batches (with the exception perhaps of input buffers which are covered
by the the --buffer-size option).  In my personal version I've been
using --nmerge, but I wasn't sure if that was user-friendly enough for
a more general audience.

I've made nmerge an unsigned int and added a check to ensure that it's
not allowed to get bigger than (SIZE_MAX / MIN_MERGE_BUFFER_SIZE).

I also realized that I had introduced tabs into tests/misc/sort-merge
where there previously hadn't been any, so I replaced those with
spaces.

Bo
From d1c257dc8c0bd6892ef252e153b72f273879c267 Mon Sep 17 00:00:00 2001
From: Bo Borgerson [EMAIL PROTECTED]
Date: Mon, 31 Mar 2008 16:58:21 -0400
Subject: [PATCH] sort: added --batch-size=NMERGE option.

* src/sort.c: Replace constant NMERGE with static unsigned int nmerge.
Validate and apply nmerge command-line settings. Replace some (now
variable-length) arrays with pointers to xnmalloc'd storage.
* tests/misc/sort-merge: Test new option
* doc/coreutils.texi: Describe new option
* NEWS: Advertise new option

Signed-off-by: Bo Borgerson [EMAIL PROTECTED]
---
 NEWS  |4 ++
 doc/coreutils.texi|   16 ++
 src/sort.c|   78 
 tests/misc/sort-merge |   31 +--
 4 files changed, 113 insertions(+), 16 deletions(-)

diff --git a/NEWS b/NEWS
index c05e0ad..c8b727c 100644
--- a/NEWS
+++ b/NEWS
@@ -50,6 +50,10 @@ GNU coreutils NEWS-*- outline -*-
   options --general-numeric-sort/-g, --month-sort/-M, --numeric-sort/-n
   and --random-sort/-R, resp.
 
+  sort accepts a new option --batch-size=NMERGE, where NMERGE
+  represents the maximum number of inputs that will be merged at once.
+  When more than NMERGE inputs are present temporary files are used.
+
 ** Improvements
 
   id and groups work around an AFS-related bug whereby those programs
diff --git a/doc/coreutils.texi b/doc/coreutils.texi
index ee7dbb2..eef8940 100644
--- a/doc/coreutils.texi
+++ b/doc/coreutils.texi
@@ -3690,6 +3690,22 @@ multiple fields.
 Example:  To sort on the second field, use @option{--key=2,2}
 (@option{-k 2,2}).  See below for more examples.
 
[EMAIL PROTECTED] [EMAIL PROTECTED]
[EMAIL PROTECTED] --batch-size
[EMAIL PROTECTED] number of inputs to merge, nmerge
+Merge at most @var{nmerge} inputs at once.
+
+If more than @var{nmerge} inputs are to be merged, then temporary files
+will be used.
+
+A large value of @var{nmerge} may improve merge performance and decrease
+temporary storage utilization at the expense of increased memory usage
+and I/0.  Conversely a small value of @var{nmerge} may reduce memory
+requirements and I/0 at the expense of temporary storage consumption and
+merge performance.
+
+The value of @var{nmerge} must be at least 2.
+
 @item -o @var{output-file}
 @itemx [EMAIL PROTECTED]
 @opindex -o
diff --git a/src/sort.c b/src/sort.c
index 8b2eec5..4dc7d1b 100644
--- a/src/sort.c
+++ b/src/sort.c
@@ -223,13 +223,13 @@ static struct month monthtab[] =
 };
 
 /* During the merge phase, the number of files to merge at once. */
-#define NMERGE 16
+#define NMERGE_DEFAULT 16
 
 /* Minimum size for a merge or check buffer.  */
 #define MIN_MERGE_BUFFER_SIZE (2 + sizeof (struct line))
 
 /* Minimum sort size; the code might not work with smaller sizes.  */
-#define MIN_SORT_SIZE (NMERGE * MIN_MERGE_BUFFER_SIZE)
+#define MIN_SORT_SIZE (nmerge * MIN_MERGE_BUFFER_SIZE)
 
 /* The number of bytes needed for a merge or check buffer, which can
function relatively efficiently even if it holds only one line.  If
@@ -281,6 +281,10 @@ static struct keyfield *keylist;
 /* Program used to (de)compress temp files.  Must accept -d.  */
 static char const *compress_program;
 
+/* Maximum number of files to merge in one go.  If more than this
+   number are present, temp files will be used. */
+static unsigned int nmerge = NMERGE_DEFAULT;
+
 static void sortlines_temp (struct line *, size_t, struct line *);
 
 /* Report MESSAGE for FILE, then clean up and exit.
@@ -344,6 +348,8 @@ Other options:\n\
   decompress them with PROG -d\n\
   -k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)\n\
   -m, --merge   merge already sorted files; do not sort\n\
+  --batch-size=NMERGE   merge at most NMERGE inputs at once;\n\
+for more use temp files\n\
 ), stdout);
   fputs (_(\
   -o, --output=FILE write result to FILE instead of standard output\n\
@@ -395,7

Re: Modifiable NMERGE in sort

2008-03-31 Thread Bo Borgerson
On Mon, Mar 31, 2008 at 2:29 AM, Paul Eggert [EMAIL PROTECTED] wrote:
  Alas, that patch assumes C99, and we can't assume that quite yet.
   Also, it mishandles nmerge values that are too large (you'll get
   core dumps or worse on many hosts).  That being said, it might be
   worth adding an option like that (it's a bit specialized, but it's a
   big performance win in some cases).
 

 Ah, yes I see.  Thanks for the feedback.  Please allow me to take
 another stab at this.

 For the first issue: I've replaced the variable-length arrays in
 mergefps with pointers xnmalloc'd storage.

 For the second: I've introduced a small dedicated function for
 validating and applying changes to nmerge.  In addition to checking
 bounds I also added a check for sort_size to ensure that it's still at
 least MIN_SORT_SIZE after an nmerge adjustment.

 Thanks again,

 Bo
--- coreutils-6.10/src/sort.c	2008-03-29 18:55:54.0 -0400
+++ coreutils-6.10-modified/src/sort.c	2008-03-31 09:23:36.0 -0400
@@ -223,13 +223,13 @@
 };
 
 /* During the merge phase, the number of files to merge at once. */
-#define NMERGE 16
+#define NMERGE_DEFAULT 16
 
 /* Minimum size for a merge or check buffer.  */
 #define MIN_MERGE_BUFFER_SIZE (2 + sizeof (struct line))
 
 /* Minimum sort size; the code might not work with smaller sizes.  */
-#define MIN_SORT_SIZE (NMERGE * MIN_MERGE_BUFFER_SIZE)
+#define MIN_SORT_SIZE (nmerge * MIN_MERGE_BUFFER_SIZE)
 
 /* The number of bytes needed for a merge or check buffer, which can
function relatively efficiently even if it holds only one line.  If
@@ -281,6 +281,10 @@
 /* Program used to (de)compress temp files.  Must accept -d.  */
 static char const *compress_program;
 
+/* Maximum number of files to merge in one go.  If more than this
+   number are present, temp files will be used. */
+static int nmerge = NMERGE_DEFAULT;
+
 static void sortlines_temp (struct line *, size_t, struct line *);
 
 /* Report MESSAGE for FILE, then clean up and exit.
@@ -341,6 +345,7 @@
   decompress them with PROG -d\n\
   -k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)\n\
   -m, --merge   merge already sorted files; do not sort\n\
+  --merge-batch-size=N  merge at most this many inputs at once\n\
 ), stdout);
   fputs (_(\
   -o, --output=FILE write result to FILE instead of standard output\n\
@@ -391,7 +396,8 @@
 {
   CHECK_OPTION = CHAR_MAX + 1,
   COMPRESS_PROGRAM_OPTION,
-  RANDOM_SOURCE_OPTION
+  RANDOM_SOURCE_OPTION,
+  NMERGE_OPTION
 };
 
 static char const short_options[] = -bcCdfgik:mMno:rRsS:t:T:uy:z;
@@ -414,6 +420,7 @@
   {output, required_argument, NULL, 'o'},
   {reverse, no_argument, NULL, 'r'},
   {stable, no_argument, NULL, 's'},
+  {merge-batch-size, required_argument, NULL, NMERGE_OPTION},
   {buffer-size, required_argument, NULL, 'S'},
   {field-separator, required_argument, NULL, 't'},
   {temporary-directory, required_argument, NULL, 'T'},
@@ -1030,6 +1037,34 @@
 #endif
 }
 
+static void
+specify_nmerge (int oi, char c, char const *s)
+{
+  uintmax_t n;
+  enum strtol_error e = xstrtoumax (s, NULL, 10, n, NULL);
+
+  if (e == LONGINT_OK)
+{
+  nmerge = n;
+  if (nmerge == n)
+	{
+	  if (nmerge = 2)
+	{
+	  /* Need to re-check that we meet the minimum
+		 requirement for memory usage with the new,
+		 potentially larger, nmerge */
+	  sort_size = MAX (sort_size, MIN_SORT_SIZE);
+	  return;
+	}
+	  e = LONGINT_INVALID;
+	}
+  else
+	e = LONGINT_OVERFLOW;
+}
+
+  xstrtol_fatal (e, oi, c, long_options, s);
+}
+
 /* Specify the amount of main memory to use when sorting.  */
 static void
 specify_sort_size (int oi, char c, char const *s)
@@ -2014,15 +2049,20 @@
 mergefps (struct sortfile *files, size_t ntemps, size_t nfiles,
 	  FILE *ofp, char const *output_file)
 {
-  FILE *fps[NMERGE];		/* Input streams for each file.  */
-  struct buffer buffer[NMERGE];	/* Input buffers for each file. */
+  FILE **fps = xnmalloc(nmerge, sizeof *fps);
+/* Input streams for each file.  */
+  struct buffer *buffer = xnmalloc(nmerge, sizeof *buffer); 
+/* Input buffers for each file. */
   struct line saved;		/* Saved line storage for unique check. */
   struct line const *savedline = NULL;
 /* saved if there is a saved line. */
   size_t savealloc = 0;		/* Size allocated for the saved line. */
-  struct line const *cur[NMERGE]; /* Current line in each line table. */
-  struct line const *base[NMERGE]; /* Base of each line table.  */
-  size_t ord[NMERGE];		/* Table representing a permutation of fps,
+  struct line const **cur = xnmalloc(nmerge, sizeof *cur); 
+/* Current line in each line table. */
+  struct line const **base = xnmalloc(nmerge, sizeof *base); 
+/* Base of each line table.  */
+  size_t *ord = xnmalloc(nmerge, sizeof ord);		
+/* Table representing a permutation of fps,
    such that cur[ord[0]] is the smallest line
    and will be next 

Re: Modifiable NMERGE in sort

2008-03-31 Thread Bo Borgerson
On Mon, Mar 31, 2008 at 11:05 AM, Pádraig Brady [EMAIL PROTECTED] wrote:
 Jim Meyering wrote:
  
   One more suggestion ;-)
  
   Add tests

  you beat me to it :)

  Also I would mention that to ammend a patch do:

   edit your files
   git commit --amend -e -a
   git format-patch --stdout --signoff HEAD~1  your-branch.diff

  Also you might add the following to useful git commands?
   git branch -D your-branch #nuke a branch

  Pádraig.


Wow, thanks for all the information guys!  I'll get right to work on
tests and documentation.

Bo


___
Bug-coreutils mailing list
Bug-coreutils@gnu.org
http://lists.gnu.org/mailman/listinfo/bug-coreutils


Modifiable NMERGE in sort

2008-03-29 Thread Bo Borgerson
Hi,

I've found at times that it's useful to merge more than NMERGE
(currently 16) inputs without using temp files.

I've modified my local version of sort to allow this number to be
overridden on the command-line (please see enclosed patch).

Is this an option that might be worth including in a future release?

Thanks,

Bo
--- coreutils-6.10/src/sort.c	2007-11-25 08:23:31.0 -0500
+++ coreutils-6.10-modified/src/sort.c	2008-03-29 16:06:45.0 -0400
@@ -223,13 +223,13 @@
 };
 
 /* During the merge phase, the number of files to merge at once. */
-#define NMERGE 16
+#define NMERGE_DEFAULT 16
 
 /* Minimum size for a merge or check buffer.  */
 #define MIN_MERGE_BUFFER_SIZE (2 + sizeof (struct line))
 
 /* Minimum sort size; the code might not work with smaller sizes.  */
-#define MIN_SORT_SIZE (NMERGE * MIN_MERGE_BUFFER_SIZE)
+#define MIN_SORT_SIZE (nmerge * MIN_MERGE_BUFFER_SIZE)
 
 /* The number of bytes needed for a merge or check buffer, which can
function relatively efficiently even if it holds only one line.  If
@@ -281,6 +281,10 @@
 /* Program used to (de)compress temp files.  Must accept -d.  */
 static char const *compress_program;
 
+/* Maximum number of files to merge in one go.  If more than this
+   number are present, temp files will be used. */
+static int nmerge = NMERGE_DEFAULT;
+
 static void sortlines_temp (struct line *, size_t, struct line *);
 
 /* Report MESSAGE for FILE, then clean up and exit.
@@ -341,6 +345,7 @@
   decompress them with PROG -d\n\
   -k, --key=POS1[,POS2] start a key at POS1, end it at POS2 (origin 1)\n\
   -m, --merge   merge already sorted files; do not sort\n\
+  --merge-batch-size=N  merge at most this many inputs at once\n\
 ), stdout);
   fputs (_(\
   -o, --output=FILE write result to FILE instead of standard output\n\
@@ -391,7 +396,8 @@
 {
   CHECK_OPTION = CHAR_MAX + 1,
   COMPRESS_PROGRAM_OPTION,
-  RANDOM_SOURCE_OPTION
+  RANDOM_SOURCE_OPTION,
+  NMERGE_OPTION
 };
 
 static char const short_options[] = -bcCdfgik:mMno:rRsS:t:T:uy:z;
@@ -414,6 +420,7 @@
   {output, required_argument, NULL, 'o'},
   {reverse, no_argument, NULL, 'r'},
   {stable, no_argument, NULL, 's'},
+  {merge-batch-size, required_argument, NULL, NMERGE_OPTION},
   {buffer-size, required_argument, NULL, 'S'},
   {field-separator, required_argument, NULL, 't'},
   {temporary-directory, required_argument, NULL, 'T'},
@@ -2014,15 +2021,15 @@
 mergefps (struct sortfile *files, size_t ntemps, size_t nfiles,
 	  FILE *ofp, char const *output_file)
 {
-  FILE *fps[NMERGE];		/* Input streams for each file.  */
-  struct buffer buffer[NMERGE];	/* Input buffers for each file. */
+  FILE *fps[nmerge];		/* Input streams for each file.  */
+  struct buffer buffer[nmerge];	/* Input buffers for each file. */
   struct line saved;		/* Saved line storage for unique check. */
   struct line const *savedline = NULL;
 /* saved if there is a saved line. */
   size_t savealloc = 0;		/* Size allocated for the saved line. */
-  struct line const *cur[NMERGE]; /* Current line in each line table. */
-  struct line const *base[NMERGE]; /* Base of each line table.  */
-  size_t ord[NMERGE];		/* Table representing a permutation of fps,
+  struct line const *cur[nmerge]; /* Current line in each line table. */
+  struct line const *base[nmerge]; /* Base of each line table.  */
+  size_t ord[nmerge];		/* Table representing a permutation of fps,
    such that cur[ord[0]] is the smallest line
    and will be next output. */
   size_t i;
@@ -2382,7 +2389,7 @@
 merge (struct sortfile *files, size_t ntemps, size_t nfiles,
char const *output_file)
 {
-  while (NMERGE  nfiles)
+  while (nmerge  nfiles)
 {
   /* Number of input files processed so far.  */
   size_t in;
@@ -2390,33 +2397,33 @@
   /* Number of output files generated so far.  */
   size_t out;
 
-  /* nfiles % NMERGE; this counts input files that are left over
+  /* nfiles % nmerge; this counts input files that are left over
 	 after all full-sized merges have been done.  */
   size_t remainder;
 
   /* Number of easily-available slots at the next loop iteration.  */
   size_t cheap_slots;
 
-  /* Do as many NMERGE-size merges as possible.  */
-  for (out = in = 0; out  nfiles / NMERGE; out++, in += NMERGE)
+  /* Do as many nmerge-size merges as possible.  */
+  for (out = in = 0; out  nfiles / nmerge; out++, in += nmerge)
 	{
 	  FILE *tfp;
 	  pid_t pid;
 	  char *temp = create_temp (tfp, pid);
-	  size_t nt = MIN (ntemps, NMERGE);
+	  size_t nt = MIN (ntemps, nmerge);
 	  ntemps -= nt;
-	  mergefps (files[in], nt, NMERGE, tfp, temp);
+	  mergefps (files[in], nt, nmerge, tfp, temp);
 	  files[out].name = temp;
 	  files[out].pid = pid;
 	}
 
   remainder = nfiles - in;
-  cheap_slots = NMERGE - out % NMERGE;
+  cheap_slots = nmerge - out % nmerge;
 
   if (cheap_slots  remainder)
 	{
 	  /* So