If you have enough memory for the intermediate results, you would have no problems with a file that large. You need an order of magnitude more memory for intermediate results than the raw data, though.
Me, if I was working with something that big, I'd probably break it into pieces first, textually, before trying to process it numerically. Like, maybe use the first column as a file name (discarding that column from the intermediate files -- or, better, replacing each acronym an index value, and also removing the '-' characters from the date column perhaps using a big string replace based on the month and year... something like ,2017-01- becomes ,201701 and so on...). Basically: putting each intermediate file in some target directory. If you didn't have even a gigabyte of memory on your machine, you could use index reads. For example, 1!:11 -- see https://www.jsoftware.com/help/dictionary/dx001.htm -- with a starting offset of 0 and a length of 1e7, then find how many you'd have to drop to get to the last line feed ((#txt) - 1+ txt i: LF), and drop those extra, and make the next offset be that many bytes further into the file. Then iterate... That said, ec2 instances go up to 3904 gigabytes of ram, which would be more than adequate to plow through that much data, if you wanted to throw money at Amazon. A 64MB machine should be big enough, though, I expect. Thanks, -- Raul On Tue, Mar 31, 2020 at 12:58 AM HH PackRat <[email protected]> wrote: > > Finishing up with function #4...... > > I have a very large file consisting of multiple sets of historical > stock prices that I would like to split into individual files for each > stock. (I'll probably first have to write out all the files to a USB > flash drive [I have limited hard drive space, but it might work as a > very tight fit] and then, when finished, burn them to a DVD-ROM for > more permanent storage.) Since I thought that J was capable of > handling very large files, I figured that this might be a challenge to > try. > > Unfortunately, I don't know how to handle file reading where you might > only be able to read a part of the file at a time. (I don't know how > large a file J can read--maybe it can read the whole file.) This file > has 14,937,606 lines and is 1.63 GB (1,759,801,721 bytes) in size. > > Additionally (and probably most importantly), I don't know how to > collect a subset of the contents of a file to output to a file, and > then resume where J left off and collect the next subset of data to > output, and so on. > > I'm going to need a LOT of help with this J programming! > > Below is a sample of the data--5 days' worth of data for 5 different > stocks. The master file is a csv file, and the individual outputs (5 > in this case) should also be csv files. (Obviously, row 0 needs to be > ignored.) The output files should use the ticker symbol as the name > for each file (e.g., AA.csv). The ticker symbol (column 0) should be > stripped off of each line of data, with only the remainder of each row > (date onward being retained) being cumulated for output. > > Please correct me if I'm wrong, but my assumption is that if code > works for these 25 lines of data, the code ought to work as well for > 14,937,606 lines! > > DATA SET D: > __________________________________________________ > > ticker,date,open,high,low,close,volume > AA,2017-06-27,31.6,32.5,31.49,31.63,5463485.0 > AA,2017-06-28,32.1,33.0,31.93,32.95,3764296.0 > AA,2017-06-29,33.11,33.34,32.61,33.18,3730077.0 > AA,2017-06-30,33.16,33.45,32.535,32.65,3014777.0 > AA,2017-07-03,32.94,34.3,32.915,34.02,3112086.0 > AAPL,2017-06-28,144.49,146.11,143.1601,145.83,21915939.0 > AAPL,2017-06-29,144.71,145.13,142.28,143.68,31116980.0 > AAPL,2017-06-30,144.45,144.96,143.78,144.02,22328979.0 > AAPL,2017-07-03,144.88,145.3001,143.1,143.5,14276812.0 > AAPL,2017-07-05,143.69,144.79,142.7237,144.09,20758795.0 > GE,2017-06-28,27.26,27.4,27.05,27.08,30759065.0 > GE,2017-06-29,27.16,27.41,26.79,27.02,36443559.0 > GE,2017-06-30,27.09,27.19,26.91,27.01,25849199.0 > GE,2017-07-03,27.16,27.59,27.06,27.45,20664966.0 > GE,2017-07-05,27.54,27.56,27.23,27.35,21082332.0 > IBM,2017-06-28,155.15,155.55,154.78,155.32,2203062.0 > IBM,2017-06-29,155.35,155.74,153.62,154.13,3245649.0 > IBM,2017-06-30,154.28,154.5,153.14,153.83,3501395.0 > IBM,2017-07-03,153.58,156.025,153.52,155.58,2822499.0 > IBM,2017-07-05,155.77,155.89,153.63,153.67,3558639.0 > T,2017-06-28,37.88,38.065,37.78,37.94,20312146.0 > T,2017-06-29,37.87,37.98,37.62,37.62,23508452.0 > T,2017-06-30,37.73,37.87,37.54,37.73,22303282.0 > T,2017-07-03,37.84,38.13,37.785,38.11,11123146.0 > T,2017-07-05,38.11,38.21,37.85,38.12,19644726.0 > __________________________________________________ > > SUPER thanks in advance for any and all help with this one! > > Harvey > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
