I don't know if this helps, but I wrote an adverb called "doSomething" which applies a verb to a file in pieces: https://code.jsoftware.com/wiki/User:Devon_McCormick/Code/WorkOnLargeFiles . There is an example of using it to work on large tab-delimited files here: https://code.jsoftware.com/wiki/User:Devon_McCormick/Code/largeFileVet .
On Tue, May 5, 2020 at 3:46 AM John Baker <[email protected]> wrote: > I've hit this problem many times over the years. One way to tackle the > memory consumption is to load the CSV as > symbols. Many CSV/TAB delimited files have highly repetitive columns. A > symbol is stored only once so repeated > columns do not consume vast amounts of memory.. > > Here are two verbs I use to load TAB delimited files (basically CSVs using > TABs as the delimiter. > > NB. read TAB delimited table files - faster than (readtd) - see long > document > > readtd2=:[: <;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }. > (10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0))) > > > NB. read TAB delimited table files as symbols - see long document > > readtd2s=:[: s:@<;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }. > (10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0))) > > > Enjoy > > > > > On Mon, May 4, 2020 at 11:10 PM Ric Sherlock <[email protected]> wrote: > > > The current parsers in tables/dsv deal with the worst case scenario > (e.g. > > type is mixed within and between columns, fields can contain field and/or > > record delimiters). The more you can assume that these edge cases are not > > present, the more you can reduce memory usage. If for example you know > that > > there is no quoting of text, and no field delimiters that exist within > > fields then you rather than using the tables/dsv addon, > > > > (1024^2) %~ 7!:2 'readcsv ''myfile_20MB.csv'' ' > > 487.617 > > > > you could just do: > > > > dat=: (','&cut);._2 freads 'myfile_20MB.csv' > > > > This decreases the memory required for processing. > > > > (1024^2) %~ 7!:2 ' ('',''&cut);._2 freads ''myfile_20MB.csv'' ' > > 359.379 > > > > If you have well-formed, simple numeric files (no header), then much of > the > > processing done by the addon is superfluous. For example the same file > (but > > space-delimited) could be read into a numeric table (not boxed text) like > > this: > > > > dat=: _999 ". ];._2 freads 'myfile_20MB.txt' > > > > (1024^2) %~ 7!:2 '_999 ". ];._2 freads ''myfile_20MB.txt'' ' > > 160.002 > > > > Doing the conversion to numeric a line-at-a-time can further decrease > total > > memory used. > > > > (1024^2) %~ 7!:2 '_999&".;._2 freads ''myfile_20MB.txt'' ' > > 96.2676 > > > > > > In the end either to cope with very large files, probably the best way of > > reducing memory usage (other than switching to another parser e.g. the Jd > > csv reader), would be to process chunks of the file at a time. This > > technique is used in other languages (e.g. python/pandas) as well. If > > someone was looking for a project then providing the option to > > automatically read (and write) files in chunks would be a nice extension > to > > the tables/dsv addon. > > > > Cheers, > > > > On Tue, May 5, 2020 at 4:31 PM bill lam <[email protected]> wrote: > > > > > I repeated your test with much large size, dat1=: 1e6#dat > > > the memory usage is 36x times the byte size of csv. > > > I think this is reasonable for J, because it used several integer > > > arrays of the same length as the csv character. But each integer is > > > 8 byte long and total byte size of 4 such integer array is already 32x > > the > > > byte size of csv. > > > > > > I don't think this is a bug in J. If you concern memory usage > efficiency, > > > you should do it in C. Putting it the other way, if efficient csv can > be > > > done using > > > J script, then special csv code in Jd is not needed. > > > > > > > > > On Tue, May 5, 2020 at 11:55 AM Aaron Ash <[email protected]> wrote: > > > > > > > Hi, > > > > > > > > I've noticed that the tables/dsv addon seems to have an extremely > high > > > > memory growth factor when processing csv data: > > > > > > > > load 'tables/csv'dat=: (34;'45';'hello';_5.34),: > > > > 12;'32';'goodbye';1.23d=: makecsv dat# d > > > > NB. 45 chars longtimespacex 'fixcsv d'NB. 2.28e_5 48644864 % 45 NB. > > > > 108.089 factor of memory growth > > > > > > > > This makes loading many datasets effectively impossible even on > > > > reasonably specced machines. > > > > > > > > A 1GB csv file would require 108GB of memory to load which seems > > > > fairly extreme to the point where I would consider this a bug. > > > > > > > > Someone on irc mentioned that generally larger datasets should be > > > > loaded in to jd and that's fair enough but I still would expect to be > > > > able to load csv data reasonably quickly and memory efficiently. > > > > > > > > Is this a bug? Is there a better library to use for csv data? > > > > > > > > Cheers, > > > > > > > > Aaron. > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > > > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > -- > John D. Baker > [email protected] > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > -- Devon McCormick, CFA Quantitative Consultant ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
