Re: [coreutils] join feature: auto-format

2010-10-07 Thread Pádraig Brady
On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:
 Hello,

 I'd like to (re)suggest a feature for the join program - the ability to 
 automatically build an output format line (similar but easier than using 
 -o).

 I've previously mentioned it here (but got no favorable responses):
 http://lists.gnu.org/archive/html/bug-coreutils/2009-11/msg00151.html

 Several people have been using this option for a year now (on our local 
 servers), so I thought I might try to suggest it again.

 The full patch is attached, and also available here:
 http://cancan.cshl.edu/labmembers/gordon/files/join_auto_format_2010_10_06.patch

 Here's the common use case:

 Given two tabular files, with a common key at first column, and many numeric 
 (or other) values on other columns, the user wants to join them together 
 easily.
 One requirement is that empty/missing values should be populated with 00.

 File 1
 ==
 bar 10 13 15 16 11 32
 foo 10 10 11 12 13 14


 File 2
 ==
 bar 99 91 90 93 91 93
 baz 90 91 99 96 97 95


 Desired joined output
 ==
 bar 10 13 15 16 11 32 99 91 90 93 91 93
 baz 00 00 00 00 00 00 90 91 99 96 97 95
 foo 10 10 11 12 13 14 00 00 00 00 00 00

 There is no technical problem in achieving this, the parameters would be:
 -a1 -a2 -e 00 -o 0,1.2,1.3,1.4,1.5,1.6,1.7,2.2,2.3,2.4,2.5,2.6,2.7

 But building the -o parameter is cumbersome, and error-prone (imaging 
 files with dozens of columns, which is very common in my case).

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.
 
 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:
 
 -o {all (default), padded, FORMAT}
 
 where padded is the functionality you're suggesting?

Thinking more about it, we mightn't need any new options at all.
Currently -e is redundant if -o is not specified.
So how about changing that so that if -e is specified
we operate as above by auto inserting empty fields?
Also I wouldn't base on the number of fields in the first line,
instead auto padding to the biggest number of fields
on the current lines under consideration.

cheers,
Pádraig.



[coreutils] [PATCH] split: fix reporting of read errors

2010-10-07 Thread Pádraig Brady
Ok to push this for the imminent release?

cheers,
Pádraig.
From ac6837d1d76e01126b98fc97df6226fc26ea365f Mon Sep 17 00:00:00 2001
From: =?utf-8?q?P=C3=A1draig=20Brady?= p...@draigbrady.com
Date: Thu, 7 Oct 2010 13:12:36 +0100
Subject: [PATCH] split: fix reporting of read errors

The bug was introduced with commit 23f6d41f, 19-02-2003.

* src/split.c (bytes_split, lines_split, line_bytes_split):
Correctly check the return from full_read().
* NEWS: Mention the fix.
---
 NEWS|3 +++
 src/split.c |6 +++---
 2 files changed, 6 insertions(+), 3 deletions(-)

diff --git a/NEWS b/NEWS
index 719ac9c..7136c2a 100644
--- a/NEWS
+++ b/NEWS
@@ -15,6 +15,9 @@ GNU coreutils NEWS-*- outline -*-
   found to be part of a directory cycle.  Before, du would issue a
   NOTIFY YOUR SYSTEM MANAGER diagnostic and fail.
 
+  split now diagnoses read errors rather than silently exiting.
+  [bug introduced in coreutils-4.5.8]
+
   tac would perform a double-free when given an input line longer than 16KiB.
   [bug introduced in coreutils-8.3]
 
diff --git a/src/split.c b/src/split.c
index 5be7207..90e2c76 100644
--- a/src/split.c
+++ b/src/split.c
@@ -229,7 +229,7 @@ bytes_split (uintmax_t n_bytes, char *buf, size_t bufsize)
   do
 {
   n_read = full_read (STDIN_FILENO, buf, bufsize);
-  if (n_read == SAFE_READ_ERROR)
+  if (n_read  bufsize  errno)
 error (EXIT_FAILURE, errno, %s, infile);
   bp_out = buf;
   to_read = n_read;
@@ -273,7 +273,7 @@ lines_split (uintmax_t n_lines, char *buf, size_t bufsize)
   do
 {
   n_read = full_read (STDIN_FILENO, buf, bufsize);
-  if (n_read == SAFE_READ_ERROR)
+  if (n_read  bufsize  errno)
 error (EXIT_FAILURE, errno, %s, infile);
   bp = bp_out = buf;
   eob = bp + n_read;
@@ -325,7 +325,7 @@ line_bytes_split (size_t n_bytes)
   /* Fill up the full buffer size from the input file.  */
 
   n_read = full_read (STDIN_FILENO, buf + n_buffered, n_bytes - n_buffered);
-  if (n_read == SAFE_READ_ERROR)
+  if (n_read  (n_bytes - n_buffered)  errno)
 error (EXIT_FAILURE, errno, %s, infile);
 
   n_buffered += n_read;
-- 
1.6.2.5



Re: [coreutils] join feature: auto-format

2010-10-07 Thread Assaf Gordon
Pádraig Brady wrote, On 10/07/2010 06:22 AM:
 On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.

 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:

 -o {all (default), padded, FORMAT}

 where padded is the functionality you're suggesting?
 
 Thinking more about it, we mightn't need any new options at all.
 Currently -e is redundant if -o is not specified.
 So how about changing that so that if -e is specified
 we operate as above by auto inserting empty fields?
 Also I wouldn't base on the number of fields in the first line,
 instead auto padding to the biggest number of fields
 on the current lines under consideration.

My concern is the principle of least surprise - if there are existing 
scripts/programs that specify -e without -o (doesn't make sense, but still 
possible) - this change will alter their behavior.

Also, implying/forcing 'auto-format' when -e is used without -o might be a 
bit confusing.
I prefer to have the user explicitly ask for auto-format - at least he/she will 
know how the output would look like.

That being said,
I can send a new patch with one of the new method (implicit autoformat or -o 
padded) - which one is preferred ?

Thanks,
 -gordon




Re: [coreutils] join feature: auto-format

2010-10-07 Thread Pádraig Brady
On 07/10/10 18:43, Assaf Gordon wrote:
 Pádraig Brady wrote, On 10/07/2010 06:22 AM:
 On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.

 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:

 -o {all (default), padded, FORMAT}

 where padded is the functionality you're suggesting?

 Thinking more about it, we mightn't need any new options at all.
 Currently -e is redundant if -o is not specified.
 So how about changing that so that if -e is specified
 we operate as above by auto inserting empty fields?
 Also I wouldn't base on the number of fields in the first line,
 instead auto padding to the biggest number of fields
 on the current lines under consideration.
 
 My concern is the principle of least surprise - if there are existing 
 scripts/programs that specify -e without -o (doesn't make sense, but 
 still possible) - this change will alter their behavior.
 
 Also, implying/forcing 'auto-format' when -e is used without -o might be 
 a bit confusing.

Well seeing as -e without -o currently does nothing,
I don't think we need to worry too much about changing that behavior.
Also to me, specifying -e EMPTY implicitly means I want
fields missing from one of the files replaced with EMPTY.

Note POSIX is more explicit, and describes our current operation:

-e EMPTY
  Replace empty output fields in the list selected by -o with EMPTY

So changing that would be an extension to POSIX.
But I still think it makes sense.
I'll prepare a patch soon, to do as I describe above,
unless there are objections.

cheers,
Pádraig.



Re: [coreutils] [PATCH] split: fix reporting of read errors

2010-10-07 Thread Pádraig Brady
On 07/10/10 13:38, Jim Meyering wrote:
 Pádraig Brady wrote:
 Ok to push this for the imminent release?
 Subject: [PATCH] split: fix reporting of read errors

 The bug was introduced with commit 23f6d41f, 19-02-2003.

 * src/split.c (bytes_split, lines_split, line_bytes_split):
 Correctly check the return from full_read().
 * NEWS: Mention the fix.
 ...
 @@ -325,7 +325,7 @@ line_bytes_split (size_t n_bytes)
/* Fill up the full buffer size from the input file.  */

n_read = full_read (STDIN_FILENO, buf + n_buffered, n_bytes - 
 n_buffered);
 -  if (n_read == SAFE_READ_ERROR)
 +  if (n_read  (n_bytes - n_buffered)  errno)
 
 Yes, thanks!
 Good catch.  How did you find it?

I was testing various failure modes for the split --number patch
and noticed split was silent when passed a directory.

 However, I'd prefer to avoid parentheses like those in the final chunk,
 as well as the duplicated expression -- there are too many n_-prefixed
 names in the vicinity.  What do you think of this instead?
 
 size_t n =  n_bytes - n_buffered;
 n_read = full_read (STDIN_FILENO, buf + n_buffered, n);
 if (n_read  n  errno)

Yes I refactored a little, added a test and pushed.

cheers,
Pádraig.



Re: [coreutils] [PATCH] split: fix reporting of read errors

2010-10-07 Thread Jim Meyering
Pádraig Brady wrote:
 Ok to push this for the imminent release?
 Subject: [PATCH] split: fix reporting of read errors

 The bug was introduced with commit 23f6d41f, 19-02-2003.

 * src/split.c (bytes_split, lines_split, line_bytes_split):
 Correctly check the return from full_read().
 * NEWS: Mention the fix.
...
 @@ -325,7 +325,7 @@ line_bytes_split (size_t n_bytes)
/* Fill up the full buffer size from the input file.  */

n_read = full_read (STDIN_FILENO, buf + n_buffered, n_bytes - 
 n_buffered);
 -  if (n_read == SAFE_READ_ERROR)
 +  if (n_read  (n_bytes - n_buffered)  errno)

Yes, thanks!
Good catch.  How did you find it?

However, I'd prefer to avoid parentheses like those in the final chunk,
as well as the duplicated expression -- there are too many n_-prefixed
names in the vicinity.  What do you think of this instead?

size_t n =  n_bytes - n_buffered;
n_read = full_read (STDIN_FILENO, buf + n_buffered, n);
if (n_read  n  errno)



[coreutils] snapshot deferred by another day: bison vs. openbsd

2010-10-07 Thread Jim Meyering
coreutils didn't compile on OpenBSD 4.7:

http://thread.gmane.org/gmane.comp.parsers.bison.bugs/3418

I'll push the gnulib fix early tomorrow.