Hello Thanks again for all the suggestions. The irony is that for the datasets I'm using the fill=T as suggested by Ivan in the first instance I think works fine. They're not particularly sophisticated datasets and although I don't know what the extra Bs (of which the first one as Avi says does occur quite late on) actually mean I don't really need to know - all I need is the date/time/station id/rainfall accumulation and that's obvious once I've read the dataset in. It has been interesting seeing the takes of people who have a far deeper and wider understanding of R than I do however and an education in itself... Nick
On Fri, 30 Sept 2022 at 20:16, <avi.e.gr...@gmail.com> wrote: > Tim and others, > > A point to consider is that there are various algorithms in the functions > used to read in formatted data into data.frame form and they vary. Some do > a > look-ahead of some size to determine things and if they find a column that > LOOKS LIKE all integers for say the first thousand lines, they go and read > in that column as integer. If the first floating point value is thousands > of > lines further along, things may go wrong. > > So asking for line/row 16 to have an extra 16th entry/column may work fine > for an algorithm that looks ahead and concludes there are 16 columns > throughout. Yet a file where the first time a sixteenth entry is seen is at > line/row 31,459 may well just set the algorithm to expect exactly 15 > columns > and then be surprised as noted above. > > I have stayed out of this discussion and others have supplied pretty much > what I would have said. I also see the data as flawed and ask which rows > are > the valid ones. If a sixteenth column is allowed, it would be better if all > other rows had an empty sixteenth column. If not allowed, none should have > it. > > The approach I might take, again as others have noted, is to preprocess the > data file using some form of stream editor such as AWK that automagically > reads in a line at a time and parses lines into a collection of tokens > based > on what separates them such as a comma. You can then either write out just > the first 15 to the output stream if your choice is to ignore a spurious > sixteenth, or write out all sixteen for every line, with the last being > some > form of null most of the time. And, of course, to be more general, you > could > make two passes through the file with the first one determining the maximum > number of entries as well as what the most common number of entries is, and > a second pass using that info to normalize the file the way you want. And > note some of what was mentioned could often be done in this preprocessing > such as removing any columns you do not want to read into R later. Do note > such filters may need to handle edge cases like skipping comment lines or > treating the row of headers differently. > > As some have shown, you can create your own filters within a language like > R > too and either read in lines and pre-process them as discussed or continue > on to making your own data.frame and skip the read.table() type of > functionality. For very large files, though, having multiple variations in > memory at once may be an issue, especially if they are not removed and > further processing and analysis continues. > > Perhaps it might be sensible to contact those maintaining the data and > point > out the anomaly and ask if their files might be saved alternately in a > format that can be used without anomalies. > > Avi > > -----Original Message----- > From: R-help <r-help-boun...@r-project.org> On Behalf Of Ebert,Timothy > Aaron > Sent: Friday, September 30, 2022 7:27 AM > To: Richard O'Keefe <rao...@gmail.com>; Nick Wray <nickmw...@gmail.com> > Cc: r-help@r-project.org > Subject: Re: [R] Reading very large text files into R > > Hi Nick, > Can you post one line of data with 15 entries followed by the next line > of data with 16 entries? > > Tim > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.