I don't know if this helps, but I wrote an adverb called "doSomething"
which applies a verb to a file in pieces:
https://code.jsoftware.com/wiki/User:Devon_McCormick/Code/WorkOnLargeFiles .
There is an example of using it to work on large tab-delimited files here:
https://code.jsoftware.com/wiki/User:Devon_McCormick/Code/largeFileVet .

On Tue, May 5, 2020 at 3:46 AM John Baker <[email protected]> wrote:

> I've hit this problem many times over the years. One way to tackle the
> memory consumption is to load the CSV as
> symbols. Many CSV/TAB delimited files have highly repetitive columns. A
> symbol is stored only once so repeated
> columns do not consume vast amounts of memory..
>
> Here are two verbs I use to load TAB delimited files (basically CSVs using
> TABs as the delimiter.
>
> NB. read TAB delimited table files - faster than (readtd) - see long
> document
>
> readtd2=:[: <;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }.
> (10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0)))
>
>
> NB. read TAB delimited table files as symbols - see long document
>
> readtd2s=:[: s:@<;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }.
> (10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0)))
>
>
> Enjoy
>
>
>
>
> On Mon, May 4, 2020 at 11:10 PM Ric Sherlock <[email protected]> wrote:
>
> >  The current parsers in tables/dsv deal with the worst case scenario
> (e.g.
> > type is mixed within and between columns, fields can contain field and/or
> > record delimiters). The more you can assume that these edge cases are not
> > present, the more you can reduce memory usage. If for example you know
> that
> > there is no quoting of text, and no field delimiters that exist within
> > fields then you rather than using the tables/dsv addon,
> >
> > (1024^2) %~ 7!:2 'readcsv ''myfile_20MB.csv'' '
> > 487.617
> >
> > you could just do:
> >
> > dat=: (','&cut);._2 freads 'myfile_20MB.csv'
> >
> > This decreases the memory required for processing.
> >
> > (1024^2) %~ 7!:2 ' ('',''&cut);._2 freads ''myfile_20MB.csv'' '
> > 359.379
> >
> > If you have well-formed, simple numeric files (no header), then much of
> the
> > processing done by the addon is superfluous. For example the same file
> (but
> > space-delimited) could be read into a numeric table (not boxed text) like
> > this:
> >
> > dat=: _999 ". ];._2 freads 'myfile_20MB.txt'
> >
> > (1024^2) %~ 7!:2 '_999 ". ];._2 freads ''myfile_20MB.txt'' '
> > 160.002
> >
> > Doing the conversion to numeric a line-at-a-time can further decrease
> total
> > memory used.
> >
> > (1024^2) %~ 7!:2 '_999&".;._2 freads ''myfile_20MB.txt'' '
> > 96.2676
> >
> >
> > In the end either to cope with very large files, probably the best way of
> > reducing memory usage (other than switching to another parser e.g. the Jd
> > csv reader), would be to process chunks of the file at a time. This
> > technique is used in other languages (e.g. python/pandas) as well. If
> > someone was looking for a project then providing the option to
> > automatically read (and write) files in chunks would be a nice extension
> to
> > the tables/dsv addon.
> >
> > Cheers,
> >
> > On Tue, May 5, 2020 at 4:31 PM bill lam <[email protected]> wrote:
> >
> > > I repeated your test with much large size, dat1=: 1e6#dat
> > > the memory usage is 36x times the byte size of csv.
> > > I think this is reasonable for J, because it used several integer
> > > arrays of the same length as the csv character. But each integer is
> > > 8 byte long and total byte size of 4 such integer array is already 32x
> > the
> > > byte size of csv.
> > >
> > > I don't think this is a bug in J. If you concern memory usage
> efficiency,
> > > you should do it in C. Putting it the other way, if efficient csv can
> be
> > > done using
> > > J script, then special csv code in Jd is not needed.
> > >
> > >
> > > On Tue, May 5, 2020 at 11:55 AM Aaron Ash <[email protected]> wrote:
> > >
> > > > Hi,
> > > >
> > > > I've noticed that the tables/dsv addon seems to have an extremely
> high
> > > > memory growth factor when processing csv data:
> > > >
> > > > load 'tables/csv'dat=: (34;'45';'hello';_5.34),:
> > > > 12;'32';'goodbye';1.23d=: makecsv dat# d
> > > > NB. 45 chars longtimespacex 'fixcsv d'NB. 2.28e_5 48644864 % 45 NB.
> > > > 108.089 factor of memory growth
> > > >
> > > > This makes loading many datasets effectively impossible even on
> > > > reasonably specced machines.
> > > >
> > > > A 1GB csv file would require 108GB of memory to load which seems
> > > > fairly extreme to the point where I would consider this a bug.
> > > >
> > > > Someone on irc mentioned that generally larger datasets should be
> > > > loaded in to jd and that's fair enough but I still would expect to be
> > > > able to load csv data reasonably quickly and memory efficiently.
> > > >
> > > > Is this a bug? Is there a better library to use for csv data?
> > > >
> > > > Cheers,
> > > >
> > > > Aaron.
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
> --
> John D. Baker
> [email protected]
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>


-- 

Devon McCormick, CFA

Quantitative Consultant
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to