If I may, and having heard no reply to my defense of component files, I'd suggest jfiles as a potential solution. Each record can take whatever size you are comfortable with: a month, year, or decade. Or am I missing something (as usual)?
> On Mar 31, 2020, at 8:05 PM, Raul Miller <[email protected]> wrote: > > If you have enough memory for the intermediate results, you would have > no problems with a file that large. You need an order of magnitude > more memory for intermediate results than the raw data, though. > > Me, if I was working with something that big, I'd probably break it > into pieces first, textually, before trying to process it numerically. > Like, maybe use the first column as a file name (discarding that > column from the intermediate files -- or, better, replacing each > acronym an index value, and also removing the '-' characters from the > date column perhaps using a big string replace based on the month and > year... something like ,2017-01- becomes ,201701 and so on...). > Basically: putting each intermediate file in some target directory. > > If you didn't have even a gigabyte of memory on your machine, you > could use index reads. For example, 1!:11 -- see > https://www.jsoftware.com/help/dictionary/dx001.htm -- with a starting > offset of 0 and a length of 1e7, then find how many you'd have to drop > to get to the last line feed ((#txt) - 1+ txt i: LF), and drop those > extra, and make the next offset be that many bytes further into the > file. Then iterate... > > That said, ec2 instances go up to 3904 gigabytes of ram, which would > be more than adequate to plow through that much data, if you wanted to > throw money at Amazon. A 64MB machine should be big enough, though, I > expect. > > Thanks, > > -- > Raul > >> On Tue, Mar 31, 2020 at 12:58 AM HH PackRat <[email protected]> wrote: >> >> Finishing up with function #4...... >> >> I have a very large file consisting of multiple sets of historical >> stock prices that I would like to split into individual files for each >> stock. (I'll probably first have to write out all the files to a USB >> flash drive [I have limited hard drive space, but it might work as a >> very tight fit] and then, when finished, burn them to a DVD-ROM for >> more permanent storage.) Since I thought that J was capable of >> handling very large files, I figured that this might be a challenge to >> try. >> >> Unfortunately, I don't know how to handle file reading where you might >> only be able to read a part of the file at a time. (I don't know how >> large a file J can read--maybe it can read the whole file.) This file >> has 14,937,606 lines and is 1.63 GB (1,759,801,721 bytes) in size. >> >> Additionally (and probably most importantly), I don't know how to >> collect a subset of the contents of a file to output to a file, and >> then resume where J left off and collect the next subset of data to >> output, and so on. >> >> I'm going to need a LOT of help with this J programming! >> >> Below is a sample of the data--5 days' worth of data for 5 different >> stocks. The master file is a csv file, and the individual outputs (5 >> in this case) should also be csv files. (Obviously, row 0 needs to be >> ignored.) The output files should use the ticker symbol as the name >> for each file (e.g., AA.csv). The ticker symbol (column 0) should be >> stripped off of each line of data, with only the remainder of each row >> (date onward being retained) being cumulated for output. >> >> Please correct me if I'm wrong, but my assumption is that if code >> works for these 25 lines of data, the code ought to work as well for >> 14,937,606 lines! >> >> DATA SET D: >> __________________________________________________ >> >> ticker,date,open,high,low,close,volume >> AA,2017-06-27,31.6,32.5,31.49,31.63,5463485.0 >> AA,2017-06-28,32.1,33.0,31.93,32.95,3764296.0 >> AA,2017-06-29,33.11,33.34,32.61,33.18,3730077.0 >> AA,2017-06-30,33.16,33.45,32.535,32.65,3014777.0 >> AA,2017-07-03,32.94,34.3,32.915,34.02,3112086.0 >> AAPL,2017-06-28,144.49,146.11,143.1601,145.83,21915939.0 >> AAPL,2017-06-29,144.71,145.13,142.28,143.68,31116980.0 >> AAPL,2017-06-30,144.45,144.96,143.78,144.02,22328979.0 >> AAPL,2017-07-03,144.88,145.3001,143.1,143.5,14276812.0 >> AAPL,2017-07-05,143.69,144.79,142.7237,144.09,20758795.0 >> GE,2017-06-28,27.26,27.4,27.05,27.08,30759065.0 >> GE,2017-06-29,27.16,27.41,26.79,27.02,36443559.0 >> GE,2017-06-30,27.09,27.19,26.91,27.01,25849199.0 >> GE,2017-07-03,27.16,27.59,27.06,27.45,20664966.0 >> GE,2017-07-05,27.54,27.56,27.23,27.35,21082332.0 >> IBM,2017-06-28,155.15,155.55,154.78,155.32,2203062.0 >> IBM,2017-06-29,155.35,155.74,153.62,154.13,3245649.0 >> IBM,2017-06-30,154.28,154.5,153.14,153.83,3501395.0 >> IBM,2017-07-03,153.58,156.025,153.52,155.58,2822499.0 >> IBM,2017-07-05,155.77,155.89,153.63,153.67,3558639.0 >> T,2017-06-28,37.88,38.065,37.78,37.94,20312146.0 >> T,2017-06-29,37.87,37.98,37.62,37.62,23508452.0 >> T,2017-06-30,37.73,37.87,37.54,37.73,22303282.0 >> T,2017-07-03,37.84,38.13,37.785,38.11,11123146.0 >> T,2017-07-05,38.11,38.21,37.85,38.12,19644726.0 >> __________________________________________________ >> >> SUPER thanks in advance for any and all help with this one! >> >> Harvey >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
