Re: [coreutils] join feature: auto-format

2011-01-12 Thread Assaf Gordon
Pádraig Brady wrote, On 01/11/2011 07:35 AM:
 
 Spending another few minutes on this, I realized
 that we should not be trying to homogenize the number
 of fields from each file, but rather the fields used
 for a particular file in each line. The only sensible
 basis for that is the first line as previously suggested.
 
 The interface would be a little different for that.
 I was thinking of:
 
   -o 'header'  Infer the format from the first line of each file
 
I second the idea of using the first line as the basis for the auto-formatting,
but have reservation about the wording: '-o header' somewhat implies that the 
first line has to be an actual header line (with column names or similar), 
while it can just be the first line of actual data if the file doesn't have a 
header line.

Something like '-o auto' might be less confusing.

Just my 2 cents,
 -gordon




Re: [coreutils] join feature: auto-format

2011-01-12 Thread Jim Meyering
Pádraig Brady wrote:

 On 12/01/11 18:04, Assaf Gordon wrote:
 Pádraig Brady wrote, On 01/11/2011 07:35 AM:

 Spending another few minutes on this, I realized
 that we should not be trying to homogenize the number
 of fields from each file, but rather the fields used
 for a particular file in each line. The only sensible
 basis for that is the first line as previously suggested.

 The interface would be a little different for that.
 I was thinking of:

   -o 'header'  Infer the format from the first line of each file

 I second the idea of using the first line as the basis for the 
 auto-formatting,
 but have reservation about the wording: '-o header' somewhat implies
 that the first line has to be an actual header line (with column
 names or similar), while it can just be the first line of actual
 data if the file doesn't have a header line.

 Something like '-o auto' might be less confusing.

 But less descriptive.
 I'll go with auto as I can't think of anything better.

-o auto is indeed more apt.
Thanks for the suggestion.



Re: [coreutils] join feature: auto-format

2011-01-07 Thread Pádraig Brady
On 06/01/11 12:05, Pádraig Brady wrote:
 On 07/10/10 19:25, Pádraig Brady wrote:
 On 07/10/10 18:43, Assaf Gordon wrote:
 Pádraig Brady wrote, On 10/07/2010 06:22 AM:
 On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.

 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:

 -o {all (default), padded, FORMAT}

 where padded is the functionality you're suggesting?

 Thinking more about it, we mightn't need any new options at all.
 Currently -e is redundant if -o is not specified.
 So how about changing that so that if -e is specified
 we operate as above by auto inserting empty fields?
 Also I wouldn't base on the number of fields in the first line,
 instead auto padding to the biggest number of fields
 on the current lines under consideration.

 My concern is the principle of least surprise - if there are existing 
 scripts/programs that specify -e without -o (doesn't make sense, but 
 still possible) - this change will alter their behavior.

 Also, implying/forcing 'auto-format' when -e is used without -o might 
 be a bit confusing.

 Well seeing as -e without -o currently does nothing,
 I don't think we need to worry too much about changing that behavior.
 Also to me, specifying -e EMPTY implicitly means I want
 fields missing from one of the files replaced with EMPTY.

 Note POSIX is more explicit, and describes our current operation:

 -e EMPTY
   Replace empty output fields in the list selected by -o with EMPTY

 So changing that would be an extension to POSIX.
 But I still think it makes sense.
 I'll prepare a patch soon, to do as I describe above,
 unless there are objections.
 
 The attached changes `join` (from what's done on other platforms) so that...
 
 `join -e` will automatically pad missing fields from one file
 so that the same number of fields are output from each file.
 Previously -e was only used for missing fields specified with -o or -j.
 
 With this change join now does:
 
 $ cat file1
 a 1 2
 b 1
 d 1 2
 
 $ cat file2
 a 3 4
 b 3 4
 c 3 4
 
 $ join -a1 -a2 -1 1 -2 1 -e. file1 file2
 a 1 2 3 4
 b 1 . 3 4
 c . . 3 4
 d 1 2 . .
 
 $ join -a1 -a2 -1 1 -2 4 -e. file1 file2
 . . . . a 3 4
 . . . . b 3 4
 . . . . c 3 4
 a 1 2 . .
 b 1 .
 d 1 2 . .
 
 $ join -a1 -a2 -1 4 -2 1 -e. file1 file2
 . a 1 2 . . .
 . b 1 . .
 . d 1 2 . . .
 a . . 3 4
 b . . 3 4
 c . . 3 4
 
 $ join -a1 -a2 -1 4 -2 4 -e. file1 file2
 . a 1 2 a 3 4
 . a 1 2 b 3 4
 . a 1 2 c 3 4
 . b 1 . a 3 4
 . b 1 . b 3 4
 . b 1 . c 3 4
 . d 1 2 a 3 4
 . d 1 2 b 3 4
 . d 1 2 c 3 4
 
 While -e without -o was previously a noop, and so could safely be extended 
 IMHO,
 this will also change the behavior when with -e and -j are specified.
 Previously if -j  1 was specified, and that field was missing,
 then -e would be used in its place, rather than the empty string.
 This still does that, but also does the padding.
 Without the -j issue I'd be 80:20 for just extending -e to auto pad,
 but given -j I'm 50:50.  The alternative it to select this with
 say '-o padded', but that's less discoverable, and complicates
 the interface somewhat.

Considering this more, I think it's safer to auto pad only
when '-o padded' is specified. I notice the plan9 `join` man page
has an example that uses -e '' to explicitly specify the NUL string as filler,
which would have triggered our auto pad if we left it as above.

cheers,
Pádraig.



Re: [coreutils] join feature: auto-format

2011-01-07 Thread Jim Meyering
Pádraig Brady wrote:
...
 While -e without -o was previously a noop, and so could safely be extended 
 IMHO,
 this will also change the behavior when with -e and -j are specified.
 Previously if -j  1 was specified, and that field was missing,
 then -e would be used in its place, rather than the empty string.
 This still does that, but also does the padding.
 Without the -j issue I'd be 80:20 for just extending -e to auto pad,
 but given -j I'm 50:50.  The alternative it to select this with
 say '-o padded', but that's less discoverable, and complicates
 the interface somewhat.

 Considering this more, I think it's safer to auto pad only
 when '-o padded' is specified. I notice the plan9 `join` man page
 has an example that uses -e '' to explicitly specify the NUL string as filler,
 which would have triggered our auto pad if we left it as above.

I'm glad you found that, confirming that
the conservative approach is better.



Re: [coreutils] join feature: auto-format

2010-10-07 Thread Pádraig Brady
On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:
 Hello,

 I'd like to (re)suggest a feature for the join program - the ability to 
 automatically build an output format line (similar but easier than using 
 -o).

 I've previously mentioned it here (but got no favorable responses):
 http://lists.gnu.org/archive/html/bug-coreutils/2009-11/msg00151.html

 Several people have been using this option for a year now (on our local 
 servers), so I thought I might try to suggest it again.

 The full patch is attached, and also available here:
 http://cancan.cshl.edu/labmembers/gordon/files/join_auto_format_2010_10_06.patch

 Here's the common use case:

 Given two tabular files, with a common key at first column, and many numeric 
 (or other) values on other columns, the user wants to join them together 
 easily.
 One requirement is that empty/missing values should be populated with 00.

 File 1
 ==
 bar 10 13 15 16 11 32
 foo 10 10 11 12 13 14


 File 2
 ==
 bar 99 91 90 93 91 93
 baz 90 91 99 96 97 95


 Desired joined output
 ==
 bar 10 13 15 16 11 32 99 91 90 93 91 93
 baz 00 00 00 00 00 00 90 91 99 96 97 95
 foo 10 10 11 12 13 14 00 00 00 00 00 00

 There is no technical problem in achieving this, the parameters would be:
 -a1 -a2 -e 00 -o 0,1.2,1.3,1.4,1.5,1.6,1.7,2.2,2.3,2.4,2.5,2.6,2.7

 But building the -o parameter is cumbersome, and error-prone (imaging 
 files with dozens of columns, which is very common in my case).

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.
 
 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:
 
 -o {all (default), padded, FORMAT}
 
 where padded is the functionality you're suggesting?

Thinking more about it, we mightn't need any new options at all.
Currently -e is redundant if -o is not specified.
So how about changing that so that if -e is specified
we operate as above by auto inserting empty fields?
Also I wouldn't base on the number of fields in the first line,
instead auto padding to the biggest number of fields
on the current lines under consideration.

cheers,
Pádraig.



Re: [coreutils] join feature: auto-format

2010-10-07 Thread Assaf Gordon
Pádraig Brady wrote, On 10/07/2010 06:22 AM:
 On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.

 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:

 -o {all (default), padded, FORMAT}

 where padded is the functionality you're suggesting?
 
 Thinking more about it, we mightn't need any new options at all.
 Currently -e is redundant if -o is not specified.
 So how about changing that so that if -e is specified
 we operate as above by auto inserting empty fields?
 Also I wouldn't base on the number of fields in the first line,
 instead auto padding to the biggest number of fields
 on the current lines under consideration.

My concern is the principle of least surprise - if there are existing 
scripts/programs that specify -e without -o (doesn't make sense, but still 
possible) - this change will alter their behavior.

Also, implying/forcing 'auto-format' when -e is used without -o might be a 
bit confusing.
I prefer to have the user explicitly ask for auto-format - at least he/she will 
know how the output would look like.

That being said,
I can send a new patch with one of the new method (implicit autoformat or -o 
padded) - which one is preferred ?

Thanks,
 -gordon




Re: [coreutils] join feature: auto-format

2010-10-07 Thread Pádraig Brady
On 07/10/10 18:43, Assaf Gordon wrote:
 Pádraig Brady wrote, On 10/07/2010 06:22 AM:
 On 07/10/10 01:03, Pádraig Brady wrote:
 On 06/10/10 21:41, Assaf Gordon wrote:

 The --auto-format feature simply builds the -o format line 
 automatically, based on the number of columns from both input files.

 Thanks for persisting with this and presenting a concise example.
 I agree that this is useful and can't think of a simple workaround.
 Perhaps the interface would be better as:

 -o {all (default), padded, FORMAT}

 where padded is the functionality you're suggesting?

 Thinking more about it, we mightn't need any new options at all.
 Currently -e is redundant if -o is not specified.
 So how about changing that so that if -e is specified
 we operate as above by auto inserting empty fields?
 Also I wouldn't base on the number of fields in the first line,
 instead auto padding to the biggest number of fields
 on the current lines under consideration.
 
 My concern is the principle of least surprise - if there are existing 
 scripts/programs that specify -e without -o (doesn't make sense, but 
 still possible) - this change will alter their behavior.
 
 Also, implying/forcing 'auto-format' when -e is used without -o might be 
 a bit confusing.

Well seeing as -e without -o currently does nothing,
I don't think we need to worry too much about changing that behavior.
Also to me, specifying -e EMPTY implicitly means I want
fields missing from one of the files replaced with EMPTY.

Note POSIX is more explicit, and describes our current operation:

-e EMPTY
  Replace empty output fields in the list selected by -o with EMPTY

So changing that would be an extension to POSIX.
But I still think it makes sense.
I'll prepare a patch soon, to do as I describe above,
unless there are objections.

cheers,
Pádraig.



Re: [coreutils] join feature: auto-format

2010-10-06 Thread Pádraig Brady
On 06/10/10 21:41, Assaf Gordon wrote:
 Hello,
 
 I'd like to (re)suggest a feature for the join program - the ability to 
 automatically build an output format line (similar but easier than using 
 -o).
 
 I've previously mentioned it here (but got no favorable responses):
 http://lists.gnu.org/archive/html/bug-coreutils/2009-11/msg00151.html
 
 Several people have been using this option for a year now (on our local 
 servers), so I thought I might try to suggest it again.
 
 The full patch is attached, and also available here:
 http://cancan.cshl.edu/labmembers/gordon/files/join_auto_format_2010_10_06.patch
 
 Here's the common use case:
 
 Given two tabular files, with a common key at first column, and many numeric 
 (or other) values on other columns, the user wants to join them together 
 easily.
 One requirement is that empty/missing values should be populated with 00.
 
 File 1
 ==
 bar 10 13 15 16 11 32
 foo 10 10 11 12 13 14
 
 
 File 2
 ==
 bar 99 91 90 93 91 93
 baz 90 91 99 96 97 95
 
 
 Desired joined output
 ==
 bar 10 13 15 16 11 32 99 91 90 93 91 93
 baz 00 00 00 00 00 00 90 91 99 96 97 95
 foo 10 10 11 12 13 14 00 00 00 00 00 00
 
 There is no technical problem in achieving this, the parameters would be:
 -a1 -a2 -e 00 -o 0,1.2,1.3,1.4,1.5,1.6,1.7,2.2,2.3,2.4,2.5,2.6,2.7
 
 But building the -o parameter is cumbersome, and error-prone (imaging files 
 with dozens of columns, which is very common in my case).
 
 The --auto-format feature simply builds the -o format line automatically, 
 based on the number of columns from both input files.

Thanks for persisting with this and presenting a concise example.
I agree that this is useful and can't think of a simple workaround.
Perhaps the interface would be better as:

-o {all (default), padded, FORMAT}

where padded is the functionality you're suggesting?

cheers,
Pádraig.