bug#22001: Is it possible to tab separate concatenated files?
Hi, On Thu, Nov 26, 2015 at 08:28:13PM -0700, Eric Blake wrote: > On 11/26/2015 04:52 PM, Linda Walsh wrote: > > >> Because every plain > >> text line in a file must be terminated with a newline. > > > >That's only a recent POSIX definition. It's not related to > > real life. When I looked for a text file definition on google, nothing > > was mentioned about needing a newline on the last line -- except on > > 1 site -- and that site was clearly not talking about 'text' files, but > > Unix-text-record files w/each record terminated by a NL char. > > > > Quit spreading FUD about POSIX. That definition of text file is NOT a > recent invention; even back in POSIX 2001 the definition read: > > 3.392 Text File > > A file that contains characters organized into one or more lines. The > lines do not contain NUL characters and none can exceed {LINE_MAX} bytes > in length, including the . Although IEEE Std 1003.1-2001 does > not distinguish between text files and binary files (see the ISO C > standard), many utilities only produce predictable or meaningful output > when operating on text files. The standard utilities that have such > restrictions always specify "text files" in their STDIN or INPUT FILES > sections. > http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html At least the definition of a "line" is needed as well to understand the above (from the same URL): 3.205 Line A sequence of zero or more non- s plus a terminating . [...] > > No, it has ALWAYS been a problem. Even 40 years ago, before POSIX was > invented, the only PORTABLE way to use programs like sed was to use it > on text files [...] The sed of Solaris 10 ignores trailing text after the last line, that is after the last newline. I am quite sure this behavior has been in older Solaris and SunOS versions as well. Best regards, Erik -- http://www.unix-ag.uni-kl.de/~auerswal/
bug#22001: Is it possible to tab separate concatenated files?
Bob Proulx wrote: That example shows a completely different problem. It shows that your input plain text files have no terminating newline, making them officially[/sic/] not plain text files but binary files. Because every plain text line in a file must be terminated with a newline. That's only a recent POSIX definition. It's not related to real life. When I looked for a text file definition on google, nothing was mentioned about needing a newline on the last line -- except on 1 site -- and that site was clearly not talking about 'text' files, but Unix-text-record files w/each record terminated by a NL char. On a mac, txt files have records separated by 'CR', and on DOS/Win, txt files have txt records separated by CRLF. Wikipedia quotes the Unicode definition of txt files -- which doesn't require the POSIX txt-record definition. Also POSIX limits txt format to 'LINE_MAX' bytes -- notice it says 'bytes' and not characters. Yet a unicode line of 256 characters can easily exceed 1024 bytes. Yet never in the the history of the english language have lines been restricted to some number of bytes or characters. But one could note that the posix definition ONLY refers to files -- not streams of TEXT (whatever the character set). Specificially, note, that with 'TEXT COLUMNMS', describe text columns measured in column widths -- yet that conflicts with the definition Text File, in that textfiles use 'bytes' for a maximum line length, while text columns use 'characters' (which can be 1-4 bytes in unicode, UTF-8 or UTF-16 encoded). Of specific note -- "text" composed of characters, MUST support 'NUL' (as well as 'the audio bell' (control-g), the backspace (control-h), vertical tabs(U+000B), form-feed(U+000C). No standard definition outside POSIX include any of those characters -- because text characters are supposed to be readable and visible. But POSIX compatibility claims that Portable Character Set ( http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap06.html#tag_06_01) must include those characters. The 'text'-files-must-have-NL' group ignores the POSIX 2008 definition of a portable character set -- but globs onto the implied definition of a text line as part of a 'text file'. But as already noted, POSIX has conflicting definitions about what text is. (Unicode measured in chars/columns or ascii (measured in bytes). But POSIX 2008 (same url as above) clearly states: A null character, NUL, which has all bits set to zero, shall be in the set of [supported] characters. In all plain-text definitions, it is mentioned that 'text' is is a set of displayable characters that can be broken into lines with the text-line separator definition. The last line of the file Needs No separation character at the end of the line as it doesn't need to be separated from anything. The GNU standard should not limit itself to an *arcane* (and not well known outside of POSIX-fans) definition of text, as it makes text files created before 2008, potentially incompatible. POSIX was supposed to be about portability... it certainly doesn't follow the internet-design-mime of "Accept input liberally, and generate output conservatively. If they are not then it isn't a text line. Must be binary. --- Whereas I maintain that Newlines are required to break plain-text into records -- but not at the end-of-file, since there is no record following. Why isn't there a newline at the end of the file? Fix that and all of your problems and many others go away. --- Didn't used to be a requirement -- it was added because of a broken interpretation of the posix standard. Please remember that a a posixified definition of 'X' (for any X), may not be the same as a real-live 'X'. In this case, we have a file containing *text* by the POSIX def, which you claim doesn't meet the POSIX definition of "text file". It's similar to Orwellian-speak -- redefining common terms to mean something else, so people don't notice the requirement change, then later telling others to clean-up their old input code/data that doesn't meet the newly created definition. Text files have been around alot longer than 8 years. Posix disqualifies most text files, for example, those created on the most widely laptop/desktop/commercial computerer OS in the world (Windows). I think what may be true is that 'POSIX text files' describe a data format that may not be how it is stored on disk. I find it very interesting in how 'NUL' is defined to be part of any POSIX text character set definition where such apps claim to support or process 'text'. It's sad to see the GNU utils becoming less flexible and more restricted over time -- much like the trend in computers to steer the public away from general purpose processing (and computers that can do such), to a tightly controlled, walled garden where consumers are only allowed to do what the manufacturer tells them to do.
bug#22001: Is it possible to tab separate concatenated files?
On 11/26/2015 04:52 PM, Linda Walsh wrote: >> Because every plain >> text line in a file must be terminated with a newline. > >That's only a recent POSIX definition. It's not related to > real life. When I looked for a text file definition on google, nothing > was mentioned about needing a newline on the last line -- except on > 1 site -- and that site was clearly not talking about 'text' files, but > Unix-text-record files w/each record terminated by a NL char. > Quit spreading FUD about POSIX. That definition of text file is NOT a recent invention; even back in POSIX 2001 the definition read: 3.392 Text File A file that contains characters organized into one or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the . Although IEEE Std 1003.1-2001 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections. http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap03.html That was POSIX Issue 6; the more recent POSIX Issue 7 corrected the definition to also allow a completely empty file to be considered as a text file. But the point is that POSIX has always required a text file to end in a newline. >On a mac, txt files have records separated by 'CR', and on DOS/Win, > txt files have txt records separated by CRLF. And those systems aren't POSIX. So they aren't relevant to a discussion about POSIX. >> Why isn't there a newline at the end of the file? Fix that and all of >> your problems and many others go away. >> > --- >Didn't used to be a requirement -- it was added because of a broken > interpretation of the posix standard. Please remember that a a posixified > definition of 'X' (for any X), may not be the same as a real-live 'X'. No, it has ALWAYS been a problem. Even 40 years ago, before POSIX was invented, the only PORTABLE way to use programs like sed was to use it on text files - namely, files where no line exceeded LINE_MAX bytes, where no lines contained NUL bytes, and where ALL lines ended in newline. Because there were vendor implementations of sed (not GNU coreutils, mind you, but other vendors) that really were hardcoded to some rather small limits, and understandably so in a day when computers did not have as much memory as they do today. POSIX just standardized existing practice on what formed a text file, when it came to existing Unix systems at that time. -- Eric Blake eblake redhat com+1-919-301-3266 Libvirt virtualization library http://libvirt.org signature.asc Description: OpenPGP digital signature
bug#22001: Is it possible to tab separate concatenated files?
Thanks Assaf, Sorry for the confusion - I wanted to add a tab (or even a new line) after each file that was concatenated. Actually a new line may be better. For Example: Concatenate the files like so: >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole genome >shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT >gi|452742846|ref|NZ_CAFD01002.1| Salmonella enterica subsp., whole genome >shotgun >sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC >gi|452742846|ref|NZ_CAFD01003.1| Salmonella enterica subsp., whole genome >shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG Right now - Just using cat, they look , like: >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole genome >shotgun >sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT>gi|452742846|ref|NZ_CAFD01002.1| > Salmonella enterica subsp., whole genome shotgun >sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC>gi|452742846|ref|NZ_CAFD01003.1| > Salmonella enterica subsp., whole genome shotgun >sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG Kim -Original Message- From: Assaf Gordon [mailto:assafgor...@gmail.com] Sent: November 23, 2015 2:03 PM To: Macdonald, Kim - BCCDC; 22...@debbugs.gnu.org Subject: Re: bug#22001: Is it possible to tab separate concatenated files? tag 22001 notabug close 22001 stop Hello Kim, On 11/23/2015 03:50 PM, Macdonald, Kim - BCCDC wrote: > I'm just looking at the options for the cat command - I see there's a > way to ignore tabs when they exist - but is there a way to tab > separate the files you're concatenating with the cat command? It is unclear (to me) what you're trying to achieve - could provide a bit more details (perhaps a short example) ? If you have a file (one file) with spaces and you wish to convert them to tabs, consider the 'expand' command (then pipe to 'cat' if needed). If you have multiple files and you wish to print them side-by-side, separated by tabs (as opposed to one-after-the-other, as with 'cat'), consider using 'paste': $ cat 1.txt a b c d $ cat 2.txt 1 2 3 4 $ cat 3.txt w x y z $ paste 1.txt 2.txt 3.txt a1 w b2 x c3 y d4 z regards, - assaf
bug#22001: Is it possible to tab separate concatenated files?
Correcting myself: On 11/23/2015 05:02 PM, Assaf Gordon wrote: If you have a file (one file) with spaces and you wish to convert them to tabs, consider the 'expand' command (then pipe to 'cat' if needed). "unexpand" will convert spaces to tabs, "expand" will convert tabs to spaces.
bug#22001: Is it possible to tab separate concatenated files?
Macdonald, Kim - BCCDC wrote: > Sorry for the confusion - I wanted to add a tab (or even a new line) > after each file that was concatenated. Actually a new line may be > better. > > For Example: > Concatenate the files like so: > >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole > >genome shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT > >gi|452742846|ref|NZ_CAFD01002.1| Salmonella enterica subsp., whole > >genome shotgun > >sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC > >gi|452742846|ref|NZ_CAFD01003.1| Salmonella enterica subsp., whole > >genome shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG > > Right now - Just using cat, they look , like: > >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole > >genome shotgun > >sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT>gi|452742846|ref|NZ_CAFD01002.1| > > Salmonella enterica subsp., whole genome shotgun > >sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC>gi|452742846|ref|NZ_CAFD01003.1| > > Salmonella enterica subsp., whole genome shotgun > >sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG That example shows a completely different problem. It shows that your input plain text files have no terminating newline, making them officially not plain text files but binary files. Because every plain text line in a file must be terminated with a newline. If they are not then it isn't a text line. Must be binary. Why isn't there a newline at the end of the file? Fix that and all of your problems and many others go away. Getting ahead of things 1... If you just can't fix the lack of a newline at the end of those files then you must handle it explicitly. for f in *.txt; do cat "$f" echo done Getting ahead of things 2... Sometimes people just want a separator between files. Actually 'tail' will already do this rather well. tail -n+0 *.txt ==> 1.txt <== foo ==> 2.txt <== bar Bob
bug#22001: Is it possible to tab separate concatenated files?
tag 22001 notabug close 22001 stop Hello Kim, On 11/23/2015 03:50 PM, Macdonald, Kim - BCCDC wrote: I’m just looking at the options for the cat command – I see there’s a way to ignore tabs when they exist – but is there a way to tab separate the files you’re concatenating with the cat command? It is unclear (to me) what you're trying to achieve - could provide a bit more details (perhaps a short example) ? If you have a file (one file) with spaces and you wish to convert them to tabs, consider the 'expand' command (then pipe to 'cat' if needed). If you have multiple files and you wish to print them side-by-side, separated by tabs (as opposed to one-after-the-other, as with 'cat'), consider using 'paste': $ cat 1.txt a b c d $ cat 2.txt 1 2 3 4 $ cat 3.txt w x y z $ paste 1.txt 2.txt 3.txt a 1 w b 2 x c 3 y d 4 z regards, - assaf
bug#22001: Is it possible to tab separate concatenated files?
Hello Kim, On 11/23/2015 06:09 PM, Bob Proulx wrote: Macdonald, Kim - BCCDC wrote: For Example: Concatenate the files like so: gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole genome shotgun sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT gi|452742846|ref|NZ_CAFD01002.1| Salmonella enterica subsp., whole genome shotgun sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC gi|452742846|ref|NZ_CAFD01003.1| Salmonella enterica subsp., whole genome shotgun sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG That example shows a completely different problem. It shows that your input plain text files have no terminating newline, making them officially not plain text files but binary files. Based on the content of your files, I'm guessing that you are working with mangled FASTA file. In that case, it is possible that fixing the original files might be more efficient than trying to amend them later on. The original FASTA files likely looked like so: >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole genome shotgun sequence TTTCAGCATATATATAGGCCATCATACATAGCCATATAT And I'm also guessing that with some script you've removed the ">" prefix and joined the two lines into one. First, I suggest ensuring the original files have unix-style new-lines (LF) and not windows style (CR-LF) or Mac-style (CR). The programs 'dos2unix' and 'mac2unix' would be able to fix it. simply run the programs on each file, they will fix it inplace. I would also recommend ensuring each file does end with a newline. Second, The FASTA id (the long text before your nucleotide sequence) contains spaces, and this will make downstream processing a bit of a pain. I would recommend trimming the FASTA identifier and keeping only the first part (since it contains your IDs, you should have no problem recovering the organism name later). Example: $ cat 1.fa >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole genome shotgun sequence TTTCAGCATATATATAGGCCATCATACATAGCCATATAT $ sed '/^>/s/ .*$//' 1.fa > 2.fa $ cat 2.fa >gi|452742846|ref|NZ_CAFD01001.1| TTTCAGCATATATATAGGCCATCATACATAGCCATATAT Or do it inplace for all your FA file (be sure to have a backup, though): for i in *.fa ; do sed -i '/^>/s/ .*$//' $i ; done Third, To combine and convert the files into a table (i.e. 1st column=ID, 2nd column=sequence), then, assuming all your sequences are short and contained on one line, the following would work: $ cat 2.fa >gi|452742846|ref|NZ_CAFD01001.1| TTTCAGCATATATATAGGCCATCATACATAGCCATATAT $ cat 3.fa >gi|452742846|ref|NZ_CAFD01002.1| CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC $ cat *.fa | paste - - | sed 's/^>//' > final.txt $ cat final.txt gi|452742846|ref|NZ_CAFD01001.1| TTTCAGCATATATATAGGCCATCATACATAGCCATATAT gi|452742846|ref|NZ_CAFD01002.1| CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC the 'final.txt' will be an easy-to-work-with tabular file. Fourth, If you FASTA files contain multi-lined long sequences, like so: >gi|452742846|ref|NZ_CAFD01002.1| CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTAC GTCGACTGACGTCTGTACACCACACGTTGTGACGAGCATCGACTAGCATCAG TTGAGCGACATCATCAGCGACGAGATCACGAGCACTAGCACTACGACTACGA You might consider using a specialized tool to convert them to a table, such as: http://manpages.ubuntu.com/manpages/trusty/man1/fasta_formatter.1.html (*) or http://kirill-kryukov.com/study/tools/fasta-formatter/ . Hope this helps, - assaf (* shameless plug: I wrote fasta_formatter long ago)
bug#22001: Is it possible to tab separate concatenated files?
Thanks so much!!! I'll try these out now Kim -Original Message- From: Assaf Gordon [mailto:assafgor...@gmail.com] Sent: November 23, 2015 3:48 PM To: Bob Proulx; Macdonald, Kim - BCCDC Cc: 22...@debbugs.gnu.org Subject: Re: bug#22001: Is it possible to tab separate concatenated files? Hello Kim, On 11/23/2015 06:09 PM, Bob Proulx wrote: > Macdonald, Kim - BCCDC wrote: >> For Example: >> Concatenate the files like so: >>> gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., >>> gi|452742846|ref|whole genome shotgun >>> gi|452742846|ref|sequenceTTTCAGCATATATATAGGCCATCATACATAGCCATATAT >>> gi|452742846|ref|NZ_CAFD01002.1| Salmonella enterica subsp., >>> gi|452742846|ref|whole genome shotgun >>> gi|452742846|ref|sequenceCATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGA >>> gi|452742846|ref|CTGACGTACGTCGACTGACGTC >>> gi|452742846|ref|NZ_CAFD01003.1| Salmonella enterica subsp., >>> gi|452742846|ref|whole genome shotgun >>> gi|452742846|ref|sequenceTATATAGATACATATATCGCGATATCAGACTGCATAGCGTCAG >> > That example shows a completely different problem. It shows that your > input plain text files have no terminating newline, making them > officially not plain text files but binary files. Based on the content of your files, I'm guessing that you are working with mangled FASTA file. In that case, it is possible that fixing the original files might be more efficient than trying to amend them later on. The original FASTA files likely looked like so: >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole genome shotgun sequence TTTCAGCATATATATAGGCCATCATACATAGCCATATAT And I'm also guessing that with some script you've removed the ">" prefix and joined the two lines into one. First, I suggest ensuring the original files have unix-style new-lines (LF) and not windows style (CR-LF) or Mac-style (CR). The programs 'dos2unix' and 'mac2unix' would be able to fix it. simply run the programs on each file, they will fix it inplace. I would also recommend ensuring each file does end with a newline. Second, The FASTA id (the long text before your nucleotide sequence) contains spaces, and this will make downstream processing a bit of a pain. I would recommend trimming the FASTA identifier and keeping only the first part (since it contains your IDs, you should have no problem recovering the organism name later). Example: $ cat 1.fa >gi|452742846|ref|NZ_CAFD01001.1| Salmonella enterica subsp., whole genome shotgun sequence TTTCAGCATATATATAGGCCATCATACATAGCCATATAT $ sed '/^>/s/ .*$//' 1.fa > 2.fa $ cat 2.fa >gi|452742846|ref|NZ_CAFD01001.1| TTTCAGCATATATATAGGCCATCATACATAGCCATATAT Or do it inplace for all your FA file (be sure to have a backup, though): for i in *.fa ; do sed -i '/^>/s/ .*$//' $i ; done Third, To combine and convert the files into a table (i.e. 1st column=ID, 2nd column=sequence), then, assuming all your sequences are short and contained on one line, the following would work: $ cat 2.fa >gi|452742846|ref|NZ_CAFD01001.1| TTTCAGCATATATATAGGCCATCATACATAGCCATATAT $ cat 3.fa >gi|452742846|ref|NZ_CAFD01002.1| CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC $ cat *.fa | paste - - | sed 's/^>//' > final.txt $ cat final.txt gi|452742846|ref|NZ_CAFD01001.1| TTTCAGCATATATATAGGCCATCATACATAGCCATATAT gi|452742846|ref|NZ_CAFD01002.1| CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTACGTCGACTGACGTC the 'final.txt' will be an easy-to-work-with tabular file. Fourth, If you FASTA files contain multi-lined long sequences, like so: >gi|452742846|ref|NZ_CAFD01002.1| CATAGCCATATATACTAGCTGACTGACGTCGCAGCTGGTCAGACTGACGTAC GTCGACTGACGTCTGTACACCACACGTTGTGACGAGCATCGACTAGCATCAG TTGAGCGACATCATCAGCGACGAGATCACGAGCACTAGCACTACGACTACGA You might consider using a specialized tool to convert them to a table, such as: http://manpages.ubuntu.com/manpages/trusty/man1/fasta_formatter.1.html (*) or http://kirill-kryukov.com/study/tools/fasta-formatter/ . Hope this helps, - assaf (* shameless plug: I wrote fasta_formatter long ago)
bug#22001: Is it possible to tab separate concatenated files?
Hi! I'm just looking at the options for the cat command - I see there's a way to ignore tabs when they exist - but is there a way to tab separate the files you're concatenating with the cat command? Thanks, Kim