Re: [coreutils] join feature: auto-format
Pádraig Brady wrote, On 01/11/2011 07:35 AM: Spending another few minutes on this, I realized that we should not be trying to homogenize the number of fields from each file, but rather the fields used for a particular file in each line. The only sensible basis for that is the first line as previously suggested. The interface would be a little different for that. I was thinking of: -o 'header' Infer the format from the first line of each file I second the idea of using the first line as the basis for the auto-formatting, but have reservation about the wording: '-o header' somewhat implies that the first line has to be an actual header line (with column names or similar), while it can just be the first line of actual data if the file doesn't have a header line. Something like '-o auto' might be less confusing. Just my 2 cents, -gordon
Re: [coreutils] join feature: auto-format
Pádraig Brady wrote: On 12/01/11 18:04, Assaf Gordon wrote: Pádraig Brady wrote, On 01/11/2011 07:35 AM: Spending another few minutes on this, I realized that we should not be trying to homogenize the number of fields from each file, but rather the fields used for a particular file in each line. The only sensible basis for that is the first line as previously suggested. The interface would be a little different for that. I was thinking of: -o 'header' Infer the format from the first line of each file I second the idea of using the first line as the basis for the auto-formatting, but have reservation about the wording: '-o header' somewhat implies that the first line has to be an actual header line (with column names or similar), while it can just be the first line of actual data if the file doesn't have a header line. Something like '-o auto' might be less confusing. But less descriptive. I'll go with auto as I can't think of anything better. -o auto is indeed more apt. Thanks for the suggestion.
Re: [coreutils] join feature: auto-format
On 06/01/11 12:05, Pádraig Brady wrote: On 07/10/10 19:25, Pádraig Brady wrote: On 07/10/10 18:43, Assaf Gordon wrote: Pádraig Brady wrote, On 10/07/2010 06:22 AM: On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. My concern is the principle of least surprise - if there are existing scripts/programs that specify -e without -o (doesn't make sense, but still possible) - this change will alter their behavior. Also, implying/forcing 'auto-format' when -e is used without -o might be a bit confusing. Well seeing as -e without -o currently does nothing, I don't think we need to worry too much about changing that behavior. Also to me, specifying -e EMPTY implicitly means I want fields missing from one of the files replaced with EMPTY. Note POSIX is more explicit, and describes our current operation: -e EMPTY Replace empty output fields in the list selected by -o with EMPTY So changing that would be an extension to POSIX. But I still think it makes sense. I'll prepare a patch soon, to do as I describe above, unless there are objections. The attached changes `join` (from what's done on other platforms) so that... `join -e` will automatically pad missing fields from one file so that the same number of fields are output from each file. Previously -e was only used for missing fields specified with -o or -j. With this change join now does: $ cat file1 a 1 2 b 1 d 1 2 $ cat file2 a 3 4 b 3 4 c 3 4 $ join -a1 -a2 -1 1 -2 1 -e. file1 file2 a 1 2 3 4 b 1 . 3 4 c . . 3 4 d 1 2 . . $ join -a1 -a2 -1 1 -2 4 -e. file1 file2 . . . . a 3 4 . . . . b 3 4 . . . . c 3 4 a 1 2 . . b 1 . d 1 2 . . $ join -a1 -a2 -1 4 -2 1 -e. file1 file2 . a 1 2 . . . . b 1 . . . d 1 2 . . . a . . 3 4 b . . 3 4 c . . 3 4 $ join -a1 -a2 -1 4 -2 4 -e. file1 file2 . a 1 2 a 3 4 . a 1 2 b 3 4 . a 1 2 c 3 4 . b 1 . a 3 4 . b 1 . b 3 4 . b 1 . c 3 4 . d 1 2 a 3 4 . d 1 2 b 3 4 . d 1 2 c 3 4 While -e without -o was previously a noop, and so could safely be extended IMHO, this will also change the behavior when with -e and -j are specified. Previously if -j 1 was specified, and that field was missing, then -e would be used in its place, rather than the empty string. This still does that, but also does the padding. Without the -j issue I'd be 80:20 for just extending -e to auto pad, but given -j I'm 50:50. The alternative it to select this with say '-o padded', but that's less discoverable, and complicates the interface somewhat. Considering this more, I think it's safer to auto pad only when '-o padded' is specified. I notice the plan9 `join` man page has an example that uses -e '' to explicitly specify the NUL string as filler, which would have triggered our auto pad if we left it as above. cheers, Pádraig.
Re: [coreutils] join feature: auto-format
Pádraig Brady wrote: ... While -e without -o was previously a noop, and so could safely be extended IMHO, this will also change the behavior when with -e and -j are specified. Previously if -j 1 was specified, and that field was missing, then -e would be used in its place, rather than the empty string. This still does that, but also does the padding. Without the -j issue I'd be 80:20 for just extending -e to auto pad, but given -j I'm 50:50. The alternative it to select this with say '-o padded', but that's less discoverable, and complicates the interface somewhat. Considering this more, I think it's safer to auto pad only when '-o padded' is specified. I notice the plan9 `join` man page has an example that uses -e '' to explicitly specify the NUL string as filler, which would have triggered our auto pad if we left it as above. I'm glad you found that, confirming that the conservative approach is better.
Re: [coreutils] join feature: auto-format
On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: Hello, I'd like to (re)suggest a feature for the join program - the ability to automatically build an output format line (similar but easier than using -o). I've previously mentioned it here (but got no favorable responses): http://lists.gnu.org/archive/html/bug-coreutils/2009-11/msg00151.html Several people have been using this option for a year now (on our local servers), so I thought I might try to suggest it again. The full patch is attached, and also available here: http://cancan.cshl.edu/labmembers/gordon/files/join_auto_format_2010_10_06.patch Here's the common use case: Given two tabular files, with a common key at first column, and many numeric (or other) values on other columns, the user wants to join them together easily. One requirement is that empty/missing values should be populated with 00. File 1 == bar 10 13 15 16 11 32 foo 10 10 11 12 13 14 File 2 == bar 99 91 90 93 91 93 baz 90 91 99 96 97 95 Desired joined output == bar 10 13 15 16 11 32 99 91 90 93 91 93 baz 00 00 00 00 00 00 90 91 99 96 97 95 foo 10 10 11 12 13 14 00 00 00 00 00 00 There is no technical problem in achieving this, the parameters would be: -a1 -a2 -e 00 -o 0,1.2,1.3,1.4,1.5,1.6,1.7,2.2,2.3,2.4,2.5,2.6,2.7 But building the -o parameter is cumbersome, and error-prone (imaging files with dozens of columns, which is very common in my case). The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. cheers, Pádraig.
Re: [coreutils] join feature: auto-format
Pádraig Brady wrote, On 10/07/2010 06:22 AM: On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. My concern is the principle of least surprise - if there are existing scripts/programs that specify -e without -o (doesn't make sense, but still possible) - this change will alter their behavior. Also, implying/forcing 'auto-format' when -e is used without -o might be a bit confusing. I prefer to have the user explicitly ask for auto-format - at least he/she will know how the output would look like. That being said, I can send a new patch with one of the new method (implicit autoformat or -o padded) - which one is preferred ? Thanks, -gordon
Re: [coreutils] join feature: auto-format
On 07/10/10 18:43, Assaf Gordon wrote: Pádraig Brady wrote, On 10/07/2010 06:22 AM: On 07/10/10 01:03, Pádraig Brady wrote: On 06/10/10 21:41, Assaf Gordon wrote: The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? Thinking more about it, we mightn't need any new options at all. Currently -e is redundant if -o is not specified. So how about changing that so that if -e is specified we operate as above by auto inserting empty fields? Also I wouldn't base on the number of fields in the first line, instead auto padding to the biggest number of fields on the current lines under consideration. My concern is the principle of least surprise - if there are existing scripts/programs that specify -e without -o (doesn't make sense, but still possible) - this change will alter their behavior. Also, implying/forcing 'auto-format' when -e is used without -o might be a bit confusing. Well seeing as -e without -o currently does nothing, I don't think we need to worry too much about changing that behavior. Also to me, specifying -e EMPTY implicitly means I want fields missing from one of the files replaced with EMPTY. Note POSIX is more explicit, and describes our current operation: -e EMPTY Replace empty output fields in the list selected by -o with EMPTY So changing that would be an extension to POSIX. But I still think it makes sense. I'll prepare a patch soon, to do as I describe above, unless there are objections. cheers, Pádraig.
Re: [coreutils] join feature: auto-format
On 06/10/10 21:41, Assaf Gordon wrote: Hello, I'd like to (re)suggest a feature for the join program - the ability to automatically build an output format line (similar but easier than using -o). I've previously mentioned it here (but got no favorable responses): http://lists.gnu.org/archive/html/bug-coreutils/2009-11/msg00151.html Several people have been using this option for a year now (on our local servers), so I thought I might try to suggest it again. The full patch is attached, and also available here: http://cancan.cshl.edu/labmembers/gordon/files/join_auto_format_2010_10_06.patch Here's the common use case: Given two tabular files, with a common key at first column, and many numeric (or other) values on other columns, the user wants to join them together easily. One requirement is that empty/missing values should be populated with 00. File 1 == bar 10 13 15 16 11 32 foo 10 10 11 12 13 14 File 2 == bar 99 91 90 93 91 93 baz 90 91 99 96 97 95 Desired joined output == bar 10 13 15 16 11 32 99 91 90 93 91 93 baz 00 00 00 00 00 00 90 91 99 96 97 95 foo 10 10 11 12 13 14 00 00 00 00 00 00 There is no technical problem in achieving this, the parameters would be: -a1 -a2 -e 00 -o 0,1.2,1.3,1.4,1.5,1.6,1.7,2.2,2.3,2.4,2.5,2.6,2.7 But building the -o parameter is cumbersome, and error-prone (imaging files with dozens of columns, which is very common in my case). The --auto-format feature simply builds the -o format line automatically, based on the number of columns from both input files. Thanks for persisting with this and presenting a concise example. I agree that this is useful and can't think of a simple workaround. Perhaps the interface would be better as: -o {all (default), padded, FORMAT} where padded is the functionality you're suggesting? cheers, Pádraig.