I've hit this problem many times over the years. One way to tackle the
memory consumption is to load the CSV as
symbols. Many CSV/TAB delimited files have highly repetitive columns. A
symbol is stored only once so repeated
columns do not consume vast amounts of memory..
Here are two verbs I use to load TAB delimited files (basically CSVs using
TABs as the delimiter.
NB. read TAB delimited table files - faster than (readtd) - see long
document
readtd2=:[: <;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }.
(10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0)))
NB. read TAB delimited table files as symbols - see long document
readtd2s=:[: s:@<;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }.
(10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0)))
Enjoy
On Mon, May 4, 2020 at 11:10 PM Ric Sherlock <[email protected]> wrote:
> The current parsers in tables/dsv deal with the worst case scenario (e.g.
> type is mixed within and between columns, fields can contain field and/or
> record delimiters). The more you can assume that these edge cases are not
> present, the more you can reduce memory usage. If for example you know that
> there is no quoting of text, and no field delimiters that exist within
> fields then you rather than using the tables/dsv addon,
>
> (1024^2) %~ 7!:2 'readcsv ''myfile_20MB.csv'' '
> 487.617
>
> you could just do:
>
> dat=: (','&cut);._2 freads 'myfile_20MB.csv'
>
> This decreases the memory required for processing.
>
> (1024^2) %~ 7!:2 ' ('',''&cut);._2 freads ''myfile_20MB.csv'' '
> 359.379
>
> If you have well-formed, simple numeric files (no header), then much of the
> processing done by the addon is superfluous. For example the same file (but
> space-delimited) could be read into a numeric table (not boxed text) like
> this:
>
> dat=: _999 ". ];._2 freads 'myfile_20MB.txt'
>
> (1024^2) %~ 7!:2 '_999 ". ];._2 freads ''myfile_20MB.txt'' '
> 160.002
>
> Doing the conversion to numeric a line-at-a-time can further decrease total
> memory used.
>
> (1024^2) %~ 7!:2 '_999&".;._2 freads ''myfile_20MB.txt'' '
> 96.2676
>
>
> In the end either to cope with very large files, probably the best way of
> reducing memory usage (other than switching to another parser e.g. the Jd
> csv reader), would be to process chunks of the file at a time. This
> technique is used in other languages (e.g. python/pandas) as well. If
> someone was looking for a project then providing the option to
> automatically read (and write) files in chunks would be a nice extension to
> the tables/dsv addon.
>
> Cheers,
>
> On Tue, May 5, 2020 at 4:31 PM bill lam <[email protected]> wrote:
>
> > I repeated your test with much large size, dat1=: 1e6#dat
> > the memory usage is 36x times the byte size of csv.
> > I think this is reasonable for J, because it used several integer
> > arrays of the same length as the csv character. But each integer is
> > 8 byte long and total byte size of 4 such integer array is already 32x
> the
> > byte size of csv.
> >
> > I don't think this is a bug in J. If you concern memory usage efficiency,
> > you should do it in C. Putting it the other way, if efficient csv can be
> > done using
> > J script, then special csv code in Jd is not needed.
> >
> >
> > On Tue, May 5, 2020 at 11:55 AM Aaron Ash <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > I've noticed that the tables/dsv addon seems to have an extremely high
> > > memory growth factor when processing csv data:
> > >
> > > load 'tables/csv'dat=: (34;'45';'hello';_5.34),:
> > > 12;'32';'goodbye';1.23d=: makecsv dat# d
> > > NB. 45 chars longtimespacex 'fixcsv d'NB. 2.28e_5 48644864 % 45 NB.
> > > 108.089 factor of memory growth
> > >
> > > This makes loading many datasets effectively impossible even on
> > > reasonably specced machines.
> > >
> > > A 1GB csv file would require 108GB of memory to load which seems
> > > fairly extreme to the point where I would consider this a bug.
> > >
> > > Someone on irc mentioned that generally larger datasets should be
> > > loaded in to jd and that's fair enough but I still would expect to be
> > > able to load csv data reasonably quickly and memory efficiently.
> > >
> > > Is this a bug? Is there a better library to use for csv data?
> > >
> > > Cheers,
> > >
> > > Aaron.
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
--
John D. Baker
[email protected]
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm