RE: there is a bug with UNIX command join
Thanks for replying so quickly. I tried $ join -t \012 -v 2 j1 j2 $ join -t '\012' -v 2 j1 j2 $ join -t \012 -v 2 j1 j2 All three versions are doing the same wrong thing, they are including the 'eee' line, which is the last line of the file j1 and a middle line of j2, when it should not include this 'eee' line. I also tried $ comm -13 j1 j2 However it does the same wrong thing as the previous three, again including the 'eee' line. I also suspect it might have something to do with the 'eee' line being the last line of the first file. --- I tried your suggestion: $ NL=$(printf \n) $ echo $NL | od -o 000 12 001 $ join -t $NL -v 2 j1 j2 But it did something weird, it eliminated the 'eee' line which is good but then it replaced all the long sequence of spaces with single spaces? It looks like the program treated the space as a field separator character and ignored the line feed. --- I tried one thing, I created two other test files with much shorter lines and tried all four commands. $ join -t \012 -v 2 s1 s2 $ join -t '\012' -v 2 s1 s2 $ join -t \012 -v 2 s1 s2 $ comm -13 s1 s2 And guess what, it? They all worked! I believe the problem is the length of the lines in the files j1 and j2 which are almost 300 characters is too long for these commands to handle. It would be nice if these commands, i.e. join comm, could handle much longer lines, say a default of 4096 characters, and a new option to specify a larger line size say up to 65535 characters. Another question is can these programs handle files that 30 MB in size and long lines? -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Sent: June 20, 2003 10:49 PM To: Robert Wolf Cc: '[EMAIL PROTECTED]' Subject: Re: there is a bug with UNIX command join Robert Wolf wrote: $ join -t \012 -v 2 j1 j2 j1 j2 The output should be only the lines in j2 that do not exist in j1. For one thing I am not convinced that the \012 will be doing what you think it will be doing here. Usually you need to handle quoted characters like that specially with the shell. Something like this. NL=$(printf \n) join -t $NL -v 2 j1 j2 Essentially I have two sorted files, and I just want the lines from the 2nd file that are not in the 1st file. Hmm... Perhaps you are really looking for 'comm -13 j1 j2' here? Bob ___ Bug-textutils mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/bug-textutils
Re: there is a bug with UNIX command join
Robert Wolf wrote: Thanks for replying so quickly. Thanks for submitting your bug report. I tried $ join -t \012 -v 2 j1 j2 $ join -t '\012' -v 2 j1 j2 $ join -t \012 -v 2 j1 j2 All three versions are doing the same wrong thing, they are including the 'eee' line, which is the last line of the file j1 and a middle line of j2, when it should not include this 'eee' line. I don't know what is in your data files. You did not share those with us. If you have a small, emphasis on small, test case perhaps you could share it with the list? I tried creating an example which would illustrate your problem but could not reproduce any trouble. I only rarely use join myself and then only in the most basic of ways. Therefore I am sorry but I am unable to guess further at what might be your trouble. I also tried $ comm -13 j1 j2 However it does the same wrong thing as the previous three, again including the 'eee' line. I also suspect it might have something to do with the 'eee' line being the last line of the first file. Since there are two different programs which are both behaving the same I hazard a guess that it is probably a misunderstanding of the behavior of them and not an actual bug. I tried one thing, I created two other test files with much shorter lines and tried all four commands. $ join -t \012 -v 2 s1 s2 $ join -t '\012' -v 2 s1 s2 $ join -t \012 -v 2 s1 s2 $ comm -13 s1 s2 And guess what, it? They all worked! I believe the problem is the length of the lines in the files j1 and j2 which are almost 300 characters is too long for these commands to handle. It would be nice if these commands, i.e. join comm, could handle much longer lines, say a default of 4096 characters, and a new option to specify a larger line size say up to 65535 characters. Another question is can these programs handle files that 30 MB in size and long lines? The join command should not have such small limitations such as you are describing. At least not intentionally. Internally they malloc memory and should be able to handle very long lines. You may not be aware but it is a GNU standards guideline that programs not have arbitrary limits such as this. As much as possible the coreutils follow that guideline. If there is a line length problem here then that would be a bug to be fixed in the program. I am still not convinced yet, however. Especially since you report that 'comm' has the same result. I also browsed the code and could not locate anything that looked like a problem here. Can you send a small test case that would allow us to recreate your problem? Thanks Bob ___ Bug-textutils mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/bug-textutils
there is a bug with UNIX command join
$ join -t \012 -v 2 j1 j2 j1 j2 The output should be only the lines in j2 that do not exist in j1. Essentially I have two sorted files, and I just want the lines from the 2nd file that are not in the 1st file. $ join --version join (textutils) 2.0.21 Written by Mike Haertel. Copyright (C) 2002 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Running Cygwin on Windows 2000. Robert Wolf.vcf ___ Bug-textutils mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/bug-textutils
Re: there is a bug with UNIX command join
Robert Wolf wrote: $ join -t \012 -v 2 j1 j2 j1 j2 The output should be only the lines in j2 that do not exist in j1. For one thing I am not convinced that the \012 will be doing what you think it will be doing here. Usually you need to handle quoted characters like that specially with the shell. Something like this. NL=$(printf \n) join -t $NL -v 2 j1 j2 Essentially I have two sorted files, and I just want the lines from the 2nd file that are not in the 1st file. Hmm... Perhaps you are really looking for 'comm -13 j1 j2' here? Bob ___ Bug-textutils mailing list [EMAIL PROTECTED] http://mail.gnu.org/mailman/listinfo/bug-textutils