Re: [coreutils] join feature: auto-format
On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: Hello, I'd like to (re)suggest a feature for the join program - the ability to automatically build an output format line (similar but easier than using -o). I've previously mentioned it here (but got no favorable responses): http://lists.gnu.org/archive/html/bug-coreutils/2009-11/msg00151.html Several people have been using this option for a year now (on our local servers), so I thought I might try to suggest it again. The full patch is attached, and also available here: http://cancan.cshl.edu/labmembers/gordon/files/join_auto_format_2010_10_06.patch Here's the common use case: Given two tabular files, with a common key at first column, and many numeric (or other) values on other columns, the user wants to join them together easily. One requirement is that empty/missing values should be populated with 00. File 1 == bar 10 13 15 16 11 32 foo 10 10 11 12 13 14 File 2 == bar 99 91 90 93 91 93 baz 90 91 99 96 97 95 Desired joined output == bar 10 13 15 16 11 32 99 91 90 93 91 93 baz 00 00 00 00 00 00 90 91 99 96 97 95 foo 10 10 11 12 13 14 00 00 00 00 00 00 There is no technical problem in achieving this, the parameters would be: -a1 -a2 -e 00 -o 0,1.2,1.3,1.4,1.5,1.6,1.7,2.2,2.3,2.4,2.5,2.6,2.7 But building the -o parameter is cumbersome, and error-prone (imaging files with dozens of columns, which is very common in my case). The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. cheers, Pádraig.
[coreutils] [PATCH] split: fix reporting of read errors
Ok to push this for the imminent release? cheers, Pádraig. From ac6837d1d76e01126b98fc97df6226fc26ea365f Mon Sep 17 00:00:00 2001 From: =?utf-8?q?P=C3=A1draig=20Brady?= p...@draigbrady.com Date: Thu, 7 Oct 2010 13:12:36 +0100 Subject: [PATCH] split: fix reporting of read errors The bug was introduced with commit 23f6d41f, 19-02-2003. * src/split.c (bytes_split, lines_split, line_bytes_split): Correctly check the return from full_read(). * NEWS: Mention the fix. --- NEWS|3 +++ src/split.c |6 +++--- 2 files changed, 6 insertions(+), 3 deletions(-) diff --git a/NEWS b/NEWS index 719ac9c..7136c2a 100644 --- a/NEWS +++ b/NEWS @@ -15,6 +15,9 @@ GNU coreutils NEWS-*- outline -*- found to be part of a directory cycle. Before, du would issue a NOTIFY YOUR SYSTEM MANAGER diagnostic and fail. + split now diagnoses read errors rather than silently exiting. + [bug introduced in coreutils-4.5.8] + tac would perform a double-free when given an input line longer than 16KiB. [bug introduced in coreutils-8.3] diff --git a/src/split.c b/src/split.c index 5be7207..90e2c76 100644 --- a/src/split.c +++ b/src/split.c @@ -229,7 +229,7 @@ bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize) do { n_read = full_read (STDIN_FILENO, buf, bufsize); - if (n_read == SAFE_READ_ERROR) + if (n_read bufsize errno) error (EXIT_FAILURE, errno, %s, infile); bp_out = buf; to_read = n_read; @@ -273,7 +273,7 @@ lines_split (uintmax_t n_lines, char *buf, size_t bufsize) do { n_read = full_read (STDIN_FILENO, buf, bufsize); - if (n_read == SAFE_READ_ERROR) + if (n_read bufsize errno) error (EXIT_FAILURE, errno, %s, infile); bp = bp_out = buf; eob = bp + n_read; @@ -325,7 +325,7 @@ line_bytes_split (size_t n_bytes) /* Fill up the full buffer size from the input file. */ n_read = full_read (STDIN_FILENO, buf + n_buffered, n_bytes - n_buffered); - if (n_read == SAFE_READ_ERROR) + if (n_read (n_bytes - n_buffered) errno) error (EXIT_FAILURE, errno, %s, infile); n_buffered += n_read; -- 1.6.2.5
Re: [coreutils] join feature: auto-format
Pádraig Brady wrote, On 10/07/2010 06:22 AM: On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. My concern is the principle of least surprise - if there are existing scripts/programs that specify -e without -o (doesn't make sense, but still possible) - this change will alter their behavior. Also, implying/forcing 'auto-format' when -e is used without -o might be a bit confusing. I prefer to have the user explicitly ask for auto-format - at least he/she will know how the output would look like. That being said, I can send a new patch with one of the new method (implicit autoformat or -o padded) - which one is preferred ? Thanks, -gordon
Re: [coreutils] join feature: auto-format
On 07/10/10 18:43, Assaf Gordon wrote: Pádraig Brady wrote, On 10/07/2010 06:22 AM: On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. My concern is the principle of least surprise - if there are existing scripts/programs that specify -e without -o (doesn't make sense, but still possible) - this change will alter their behavior. Also, implying/forcing 'auto-format' when -e is used without -o might be a bit confusing. Well seeing as -e without -o currently does nothing, I don't think we need to worry too much about changing that behavior. Also to me, specifying -e EMPTY implicitly means I want fields missing from one of the files replaced with EMPTY. Note POSIX is more explicit, and describes our current operation: -e EMPTY Replace empty output fields in the list selected by -o with EMPTY So changing that would be an extension to POSIX. But I still think it makes sense. I'll prepare a patch soon, to do as I describe above, unless there are objections. cheers, Pádraig.
Re: [coreutils] [PATCH] split: fix reporting of read errors
On 07/10/10 13:38, Jim Meyering wrote: Pádraig Brady wrote: Ok to push this for the imminent release? Subject: [PATCH] split: fix reporting of read errors The bug was introduced with commit 23f6d41f, 19-02-2003. * src/split.c (bytes_split, lines_split, line_bytes_split): Correctly check the return from full_read(). * NEWS: Mention the fix. ... @@ -325,7 +325,7 @@ line_bytes_split (size_t n_bytes) /* Fill up the full buffer size from the input file. */ n_read = full_read (STDIN_FILENO, buf + n_buffered, n_bytes - n_buffered); - if (n_read == SAFE_READ_ERROR) + if (n_read (n_bytes - n_buffered) errno) Yes, thanks! Good catch. How did you find it? I was testing various failure modes for the split --number patch and noticed split was silent when passed a directory. However, I'd prefer to avoid parentheses like those in the final chunk, as well as the duplicated expression -- there are too many n_-prefixed names in the vicinity. What do you think of this instead? size_t n = n_bytes - n_buffered; n_read = full_read (STDIN_FILENO, buf + n_buffered, n); if (n_read n errno) Yes I refactored a little, added a test and pushed. cheers, Pádraig.
Re: [coreutils] [PATCH] split: fix reporting of read errors
Pádraig Brady wrote: Ok to push this for the imminent release? Subject: [PATCH] split: fix reporting of read errors The bug was introduced with commit 23f6d41f, 19-02-2003. * src/split.c (bytes_split, lines_split, line_bytes_split): Correctly check the return from full_read(). * NEWS: Mention the fix. ... @@ -325,7 +325,7 @@ line_bytes_split (size_t n_bytes) /* Fill up the full buffer size from the input file. */ n_read = full_read (STDIN_FILENO, buf + n_buffered, n_bytes - n_buffered); - if (n_read == SAFE_READ_ERROR) + if (n_read (n_bytes - n_buffered) errno) Yes, thanks! Good catch. How did you find it? However, I'd prefer to avoid parentheses like those in the final chunk, as well as the duplicated expression -- there are too many n_-prefixed names in the vicinity. What do you think of this instead? size_t n = n_bytes - n_buffered; n_read = full_read (STDIN_FILENO, buf + n_buffered, n); if (n_read n errno)
[coreutils] snapshot deferred by another day: bison vs. openbsd
coreutils didn't compile on OpenBSD 4.7: http://thread.gmane.org/gmane.comp.parsers.bison.bugs/3418 I'll push the gnulib fix early tomorrow.