On 28/01/11 18:10, Jim Meyering wrote: > I would like to release coreutils-8.10 next week, including the changes > on the fiemap-copy-2 branch (shortly to be extended). I'll make a test > release as soon as the FIEMAP changes have been merged to master. > > If anyone has additional changes that they would like to see included, > please let us know.
I intend to apply the attached: join: ensure --header skips the order check with empty files join: don't report disorder against an empty file join: add -o 'auto' to output a constant number of fields per line doc: add alternatives for field processing not supported by cut cheers, Pádraig.
>From c36b20e51f028d13d06c27d86c0cf009313bacd3 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]> Date: Fri, 14 Jan 2011 08:46:21 +0000 Subject: [PATCH] join: ensure --header skips the order check with empty files * src/join.c: Skip the header even if one of the files is empty. * tests/misc/join: Add a test case. * NEWS: Mention the fix --- NEWS | 3 +++ src/join.c | 12 ++++++++---- tests/misc/join | 6 ++++++ 3 files changed, 17 insertions(+), 4 deletions(-) diff --git a/NEWS b/NEWS index 9ccad63..82565b0 100644 --- a/NEWS +++ b/NEWS @@ -13,6 +13,9 @@ GNU coreutils NEWS -*- outline -*- rm -f no longer fails for EINVAL or EILSEQ on file systems that reject file names invalid for that file system. + join --header now skips the ordering check for the first line + even if the other file is empty. + * Noteworthy changes in release 8.9 (2011-01-04) [stable] diff --git a/src/join.c b/src/join.c index afda5a1..07ac856 100644 --- a/src/join.c +++ b/src/join.c @@ -627,13 +627,17 @@ join (FILE *fp1, FILE *fp2) initseq (&seq2); getseq (fp2, &seq2, 2); - if (join_header_lines && seq1.count && seq2.count) + if (join_header_lines && (seq1.count || seq2.count)) { - prjoin (seq1.lines[0], seq2.lines[0]); + struct line const *hline1 = seq1.count ? seq1.lines[0] : &uni_blank; + struct line const *hline2 = seq2.count ? seq2.lines[0] : &uni_blank; + prjoin (hline1, hline2); prevline[0] = NULL; prevline[1] = NULL; - advance_seq (fp1, &seq1, true, 1); - advance_seq (fp2, &seq2, true, 2); + if (seq1.count) + advance_seq (fp1, &seq1, true, 1); + if (seq2.count) + advance_seq (fp2, &seq2, true, 2); } while (seq1.count && seq2.count) diff --git a/tests/misc/join b/tests/misc/join index 3696a03..0299427 100755 --- a/tests/misc/join +++ b/tests/misc/join @@ -219,6 +219,12 @@ my @tv = ( [ "ID1 Name\n1 A\n2 B\n", "ID2 Color\n1 red\n"], "ID1 Name Color\n1 A red\n", 0], +# '--header' doesn't check order of a header +# even if there is no header in the second file +['header-6', '--header -a1', + [ "ID1 Name\n1 A\n", ""], + "ID1 Name\n1 A\n", 0], + ); # Convert the above old-style test vectors to the newer -- 1.7.3.4
>From 1ebde7af7d1e122bcb2d0935d0b37af21156f357 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]> Date: Thu, 27 Jan 2011 07:17:16 +0000 Subject: [PATCH] join: don't report disorder against an empty file This allows one to use join as a field extractor like: join -a1 -o 1.3,1.1 - /dev/null * src/join.c (join): Don't flag unpairable lines when one of the files is empty. * tests/misc/join: Add a new test for empty input, and adjust a previous test that was only checking against empty input. * doc/coreutils.texi (join invocation): Document the change. * NEWS: Likewise. --- NEWS | 6 ++++++ doc/coreutils.texi | 18 +++++++++++++----- src/join.c | 8 +++++--- tests/misc/join | 6 +++++- 4 files changed, 29 insertions(+), 9 deletions(-) diff --git a/NEWS b/NEWS index 9ccad63..422bbe6 100644 --- a/NEWS +++ b/NEWS @@ -13,6 +13,12 @@ GNU coreutils NEWS -*- outline -*- rm -f no longer fails for EINVAL or EILSEQ on file systems that reject file names invalid for that file system. +** Changes in behavior + + join no longer reports disorder when one of the files is empty. + This allows one to use join as a field extractor like: + join -a1 -o 1.3,1.1 - /dev/null + * Noteworthy changes in release 8.9 (2011-01-04) [stable] diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 85d5201..c2a7580 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -4761,11 +4761,17 @@ If there is an error it exits with nonzero status. @macro checkOrderOption{cmd} If the @option{--check-order} option is given, unsorted inputs will cause a fatal error message. If the option @option{--nocheck-order} -is given, unsorted inputs will never cause an error message. If -neither of these options is given, wrongly sorted inputs are diagnosed -only if an input file is found to contain unpairable lines. If an -input file is diagnosed as being unsorted, the @command{\cmd\} command -will exit with a nonzero status (and the output should not be used). +is given, unsorted inputs will never cause an error message. If neither +of these options is given, wrongly sorted inputs are diagnosed +only if an input file is found to contain unpairable +@ifset JOIN_COMMAND +lines, and when both input files are non empty. +@end ifset +@ifclear JOIN_COMMAND +lines. +@end ifclear +If an input file is diagnosed as being unsorted, the @command{\cmd\} +command will exit with a nonzero status (and the output should not be used). Forcing @command{\cmd\} to process wrongly sorted input files containing unpairable lines by specifying @option{--nocheck-order} is @@ -5646,7 +5652,9 @@ c c1 c2 b b1 b2 @end example +@set JOIN_COMMAND @checkOrderOption{join} +@clear JOIN_COMMAND The defaults are: @itemize diff --git a/src/join.c b/src/join.c index afda5a1..6e10f61 100644 --- a/src/join.c +++ b/src/join.c @@ -711,7 +711,7 @@ join (FILE *fp1, FILE *fp2) seq2.count = 0; } - /* If the user did not specify --check-order, then we read the + /* If the user did not specify --nocheck-order, then we read the tail ends of both inputs to verify that they are in order. We skip the rest of the tail once we have issued a warning for that file, unless we actually need to print the unpairable lines. */ @@ -726,7 +726,8 @@ join (FILE *fp1, FILE *fp2) { if (print_unpairables_1) prjoin (seq1.lines[0], &uni_blank); - seen_unpairable = true; + if (seq2.count) + seen_unpairable = true; while (get_line (fp1, &line, 1)) { if (print_unpairables_1) @@ -740,7 +741,8 @@ join (FILE *fp1, FILE *fp2) { if (print_unpairables_2) prjoin (&uni_blank, seq2.lines[0]); - seen_unpairable = true; + if (seq1.count) + seen_unpairable = true; while (get_line (fp2, &line, 2)) { if (print_unpairables_2) diff --git a/tests/misc/join b/tests/misc/join index 3696a03..3ce267c 100755 --- a/tests/misc/join +++ b/tests/misc/join @@ -189,7 +189,11 @@ my @tv = ( # Before 6.10.143, this would mistakenly fail with the diagnostic: # join: File 1 is not in sorted order -['chkodr-7', '-12', ["2 a\n1 b\n", ""], "", 0], +['chkodr-7', '-12', ["2 a\n1 b\n", "2 c\n1 d"], "", 0], + +# After 8.9, join doesn't report disorder by default +# when comparing against an empty input file. +['chkodr-8', '', ["2 a\n1 b\n", ""], "", 0], # Test '--header' feature ['header-1', '--header', -- 1.7.3.4
>From 29746dc55e6176934388c772dbe70012859897ff Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]> Date: Wed, 5 Jan 2011 11:52:54 +0000 Subject: [PATCH] join: add -o 'auto' to output a constant number of fields per line Lines with a different number of fields than the first line, will be truncated or padded. * src/join.c (prfields): A new function refactored from prjoin(), to output all but the join field. (prjoin): Don't swap line1 and line2 when line1 is blank so that the padding is applied to the right place. (main): Handle the -o 'auto' option. * tests/misc/join: Add 6 new cases to test the auto format. * NEWS: Mention the change in behavior. Suggestion from Assaf Gordon --- NEWS | 6 +++ doc/coreutils.texi | 19 ++++++++--- src/join.c | 88 +++++++++++++++++++++++++++++++++------------------ tests/misc/join | 20 ++++++++++++ 4 files changed, 96 insertions(+), 37 deletions(-) diff --git a/NEWS b/NEWS index 9ccad63..a9d329a 100644 --- a/NEWS +++ b/NEWS @@ -13,6 +13,12 @@ GNU coreutils NEWS -*- outline -*- rm -f no longer fails for EINVAL or EILSEQ on file systems that reject file names invalid for that file system. +** New features + + join now supports -o 'auto' which will automatically infer the + output format from the first line in each file, to ensure + the same number of fields are output for each line. + * Noteworthy changes in release 8.9 (2011-01-04) [stable] diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 85d5201..9397ab3 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -5675,8 +5675,8 @@ Do not check that both input files are in sorted order. This is the default. @item -e @var{string} @opindex -e -Replace those output fields that are missing in the input with -@var{string}. +Replace those output fields that are missing in the input with @var{string}. +I.E. missing fields specified with the @option{-12jo} options. @item --header @opindex --header @@ -5707,10 +5707,17 @@ Join on field @var{field} (a positive integer) of file 2. Equivalent to @option{-1 @var{field} -2 @var{field}}. @item -o @var{field-list} -Construct each output line according to the format in @var{field-list}. -Each element in @var{field-list} is either the single character @samp{0} or -has the form @var{m.n} where the file number, @var{m}, is @samp{1} or -@samp{2} and @var{n} is a positive field number. +@itemx -o auto +If the keyword @samp{auto} is specified, infer the output format from +the first line in each file. This is the same as the default output format +but also ensures the same number of fields are output for each line. +Missing fields are replaced with the @option{-e} option and extra fields +are discarded. + +Otherwise, construct each output line according to the format in +@var{field-list}. Each element in @var{field-list} is either the single +character @samp{0} or has the form @var{m.n} where the file number, @var{m}, +is @samp{1} or @samp{2} and @var{n} is a positive field number. A field specification of @samp{0} denotes the join field. In most cases, the functionality of the @samp{0} field spec diff --git a/src/join.c b/src/join.c index afda5a1..bf7e908 100644 --- a/src/join.c +++ b/src/join.c @@ -112,6 +112,13 @@ static bool issued_disorder_warning[2]; /* Empty output field filler. */ static char const *empty_filler; +/* Whether to ensure the same number of fields are output from each line. */ +static bool autoformat; +/* The number of fields to output for each line. + Only significant when autoformat is true. */ +static size_t autocount_1; +static size_t autocount_2; + /* Field to join on; SIZE_MAX means they haven't been determined yet. */ static size_t join_field_1 = SIZE_MAX; static size_t join_field_2 = SIZE_MAX; @@ -210,7 +217,8 @@ else fields are separated by CHAR. Any FIELD is a field number counted\n\ from 1. FORMAT is one or more comma or blank separated specifications,\n\ each being `FILENUM.FIELD' or `0'. Default FORMAT outputs the join field,\n\ the remaining fields from FILE1, the remaining fields from FILE2, all\n\ -separated by CHAR.\n\ +separated by CHAR. If FORMAT is the keyword 'auto', then the first\n\ +line of each file determines the number of fields output for each line.\n\ \n\ Important: FILE1 and FILE2 must be sorted on the join fields.\n\ E.g., use ` sort -k 1b,1 ' if `join' has no options,\n\ @@ -527,6 +535,27 @@ prfield (size_t n, struct line const *line) fputs (empty_filler, stdout); } +/* Output all the fields in line, other than the join field. */ + +static void +prfields (struct line const *line, size_t join_field, size_t autocount) +{ + size_t i; + size_t nfields = autoformat ? autocount : line->nfields; + char output_separator = tab < 0 ? ' ' : tab; + + for (i = 0; i < join_field && i < nfields; ++i) + { + putchar (output_separator); + prfield (i, line); + } + for (i = join_field + 1; i < nfields; ++i) + { + putchar (output_separator); + prfield (i, line); + } +} + /* Print the join of LINE1 and LINE2. */ static void @@ -534,6 +563,8 @@ prjoin (struct line const *line1, struct line const *line2) { const struct outlist *outlist; char output_separator = tab < 0 ? ' ' : tab; + size_t field; + struct line const *line; outlist = outlist_head.next; if (outlist) @@ -543,9 +574,6 @@ prjoin (struct line const *line1, struct line const *line2) o = outlist; while (1) { - size_t field; - struct line const *line; - if (o->file == 0) { if (line1 == &uni_blank) @@ -574,37 +602,24 @@ prjoin (struct line const *line1, struct line const *line2) } else { - size_t i; - if (line1 == &uni_blank) { - struct line const *t; - t = line1; - line1 = line2; - line2 = t; + line = line2; + field = join_field_2; } - prfield (join_field_1, line1); - for (i = 0; i < join_field_1 && i < line1->nfields; ++i) - { - putchar (output_separator); - prfield (i, line1); - } - for (i = join_field_1 + 1; i < line1->nfields; ++i) + else { - putchar (output_separator); - prfield (i, line1); + line = line1; + field = join_field_1; } - for (i = 0; i < join_field_2 && i < line2->nfields; ++i) - { - putchar (output_separator); - prfield (i, line2); - } - for (i = join_field_2 + 1; i < line2->nfields; ++i) - { - putchar (output_separator); - prfield (i, line2); - } + /* Output the join field. */ + prfield (field, line); + + /* Output other fields. */ + prfields (line1, join_field_1, autocount_1); + prfields (line2, join_field_2, autocount_2); + putchar ('\n'); } } @@ -627,6 +642,12 @@ join (FILE *fp1, FILE *fp2) initseq (&seq2); getseq (fp2, &seq2, 2); + if (autoformat) + { + autocount_1 = seq1.count ? seq1.lines[0]->nfields : 0; + autocount_2 = seq2.count ? seq2.lines[0]->nfields : 0; + } + if (join_header_lines && seq1.count && seq2.count) { prjoin (seq1.lines[0], seq2.lines[0]); @@ -1037,8 +1058,13 @@ main (int argc, char **argv) break; case 'o': - add_field_list (optarg); - optc_status = MIGHT_BE_O_ARG; + if (STREQ (optarg, "auto")) + autoformat = true; + else + { + add_field_list (optarg); + optc_status = MIGHT_BE_O_ARG; + } break; case 't': diff --git a/tests/misc/join b/tests/misc/join index 3696a03..3cf278b 100755 --- a/tests/misc/join +++ b/tests/misc/join @@ -127,6 +127,26 @@ my @tv = ( # From David Dyck ['9a', '', [" a 1\n b 2\n", " a Y\n b Z\n"], "a 1 Y\nb 2 Z\n", 0], +# -o 'auto' +['10a', '-a1 -a2 -e . -o auto', + ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"], + "a 1 2 3 4\nb 1 . 3 4\nc . . 3 4\nd 1 2 . .\n", 0], +['10b', '-a1 -a2 -j3 -e . -o auto', + ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"], + "2 a 1 . .\n. b 1 . .\n2 d 1 . .\n4 . . a 3\n4 . . b 3\n4 . . c 3\n"], +['10c', '-a1 -1 1 -2 4 -e. -o auto', + ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"], + "a 1 2 . . .\nb 1 . . . .\nd 1 2 . . .\n"], +['10d', '-a2 -1 1 -2 4 -e. -o auto', + ["a 1 2\nb 1\nd 1 2\n", "a 3 4\nb 3 4\nc 3 4\n"], + ". . . a 3 4\n. . . b 3 4\n. . . c 3 4\n"], +['10e', '-o auto', + ["a 1 2\nb 1 2 discard\n", "a 3 4\nb 3 4 discard\n"], + "a 1 2 3 4\nb 1 2 3 4\n"], +['10f', '-t, -o auto', + ["a,1,,2\nb,1,2\n", "a,3,4\nb,3,4\n"], + "a,1,,2,3,4\nb,1,2,,3,4\n"], + # From Tim Smithers: fixed in 1.22l ['trailing-sp', '-t: -1 1 -2 1', ["a:x \n", "a:y \n"], "a:x :y \n", 0], -- 1.7.3.4
>From 97ae252c17fdf913cad04438fa873e840c397977 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?P=C3=A1draig=20Brady?= <[email protected]> Date: Wed, 24 Nov 2010 08:37:23 +0000 Subject: [PATCH] doc: add alternatives for field processing not supported by cut * doc/coreutils.texi (cut invocation): Remove the tr -s '[:blank:]' example, as it doesn't handle leading and trailing blanks. Add `awk` examples for common field processing operations often asked about. Also document a `join` hack, to achieve the same thing. --- doc/coreutils.texi | 23 +++++++++++++++++++---- 1 files changed, 19 insertions(+), 4 deletions(-) diff --git a/doc/coreutils.texi b/doc/coreutils.texi index 85d5201..b30b68f 100644 --- a/doc/coreutils.texi +++ b/doc/coreutils.texi @@ -5474,11 +5474,26 @@ Select for printing only the fields listed in @var{field-list}. Fields are separated by a TAB character by default. Also print any line that contains no delimiter character, unless the @option{--only-delimited} (@option{-s}) option is specified. -Note @command{cut} does not support specifying runs of whitespace as a -delimiter, so to achieve that common functionality one can pre-process -with @command{tr} like: + +Note @command{awk} supports more sophisticated field processing, +and by default will use (and discard) runs of blank characters to +separate fields, and ignore leading and trailing blanks. +@example +@verbatim +awk '{print $2}' # print the second field +awk '{print $NF-1}' # print the penultimate field +awk '{print $2,$1}' # reorder the first two fields +@end verbatim +@end example + +In the unlikely event that @command{awk} is unavailable, +one can use the @command{join} command, to process blank +characters as @command{awk} does above. @example -tr -s '[:blank:]' '\t' | cut -f@dots{} +@verbatim +join -a1 - /dev/null -o 1.2 # print the second field +join -a1 - /dev/null -o 1.2,1.1 # reorder the first two fields +@end verbatim @end example @item -d @var{input_delim_byte} -- 1.7.3.4
