Re: [R] R tools for large files

Murray Jorgensen Tue, 26 Aug 2003 08:53:43 +0000

Hi Martin,

I don't know much about the concept of "connection" but I had supposed it to at least include the concept of "file" and perhaps also "input device" and "output device'. I guess the important point that you are making is that it is sequential in the sense that you describe. I suppose at the time that I wrote my emails I didn't *know* that this was the case but rather assumed that this must be so, since it would be tedious in the extreme to have to work with the access functions if they kept going back to the beginning of the connection.

It may help to explain the application. The large files that I am working with are themselves statistical summaries of internet traffic flows (you will appreciate why they can be almost arbitrarily large!) I am interested in clustering these flows into different classes of traffic. I am using a model-based approach, so that the end-point will be statistical models for each cluster. Once these have been estimated they may be used in the classification of future traffic [including a residual class of traffic that does not fit any cluster well].

Based on experience with my clustering software (Multimix) I believe that it should work well on data sets of, say, 3000 observations. I plan to select a small number of random subsets of this size. The replication of these subsets should help me with model selection questions (How many Clusters? How complex should each cluster model be?)

Tom Mulholland makes a good point when he notes that many R users (and other users) have very little control over their computing environment owing to somewhat arbitrary IT management decisions. For this reason it will be advantageous to have several solutions to large file problems.

I'm pleased that you think that efficient R functions for manipulating numbered lines from files may be written. I'm going to have a go at it just as soon as I finish a big item of paperwork!

BTW, I will be out of town and with much reduced email access over the next week or so, so if I don't reply to the list or individuals this should not be put down to laziness or rudeness!

Cheers,

Murray Jorgensen

PS Give my regards to Chris Hennig.

Martin Maechler wrote:

Hi Murray,
from reading your summarizing reply, I wonder if you missed the
most important point about "connection"s  connection := generalization of file):
Once you open() one, you can read it **sequentially**, e.g., in bunches of a "few" lines i.e., you don't re-start from the beginning each time. I think this will allow to devise a pretty efficient R function for reading (and returning as a vector of strings) line numbers (n1, n2,..., nm).
Did you know this?  If not, maybe you forward this answer (and
your reaction to it) to R-help as well.
Regards,
Martin Maechler <[EMAIL PROTECTED]>       http://stat.ethz.ch/~maechler/
Seminar fuer Statistik, ETH-Zentrum  LEO C16    Leonhardstr. 27
ETH (Federal Inst. Technology)  8092 Zurich     SWITZERLAND
phone: x-41-1-632-3408          fax: ...-1228                   <><


--
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: [EMAIL PROTECTED]                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862

______________________________________________
[EMAIL PROTECTED] mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help

Re: [R] R tools for large files

Reply via email to