The current parsers in tables/dsv deal with the worst case scenario (e.g.
type is mixed within and between columns, fields can contain field and/or
record delimiters). The more you can assume that these edge cases are not
present, the more you can reduce memory usage. If for example you know that
there is no quoting of text, and no field delimiters that exist within
fields then you rather than using the tables/dsv addon,
(1024^2) %~ 7!:2 'readcsv ''myfile_20MB.csv'' '
487.617
you could just do:
dat=: (','&cut);._2 freads 'myfile_20MB.csv'
This decreases the memory required for processing.
(1024^2) %~ 7!:2 ' ('',''&cut);._2 freads ''myfile_20MB.csv'' '
359.379
If you have well-formed, simple numeric files (no header), then much of the
processing done by the addon is superfluous. For example the same file (but
space-delimited) could be read into a numeric table (not boxed text) like
this:
dat=: _999 ". ];._2 freads 'myfile_20MB.txt'
(1024^2) %~ 7!:2 '_999 ". ];._2 freads ''myfile_20MB.txt'' '
160.002
Doing the conversion to numeric a line-at-a-time can further decrease total
memory used.
(1024^2) %~ 7!:2 '_999&".;._2 freads ''myfile_20MB.txt'' '
96.2676
In the end either to cope with very large files, probably the best way of
reducing memory usage (other than switching to another parser e.g. the Jd
csv reader), would be to process chunks of the file at a time. This
technique is used in other languages (e.g. python/pandas) as well. If
someone was looking for a project then providing the option to
automatically read (and write) files in chunks would be a nice extension to
the tables/dsv addon.
Cheers,
On Tue, May 5, 2020 at 4:31 PM bill lam <[email protected]> wrote:
> I repeated your test with much large size, dat1=: 1e6#dat
> the memory usage is 36x times the byte size of csv.
> I think this is reasonable for J, because it used several integer
> arrays of the same length as the csv character. But each integer is
> 8 byte long and total byte size of 4 such integer array is already 32x the
> byte size of csv.
>
> I don't think this is a bug in J. If you concern memory usage efficiency,
> you should do it in C. Putting it the other way, if efficient csv can be
> done using
> J script, then special csv code in Jd is not needed.
>
>
> On Tue, May 5, 2020 at 11:55 AM Aaron Ash <[email protected]> wrote:
>
> > Hi,
> >
> > I've noticed that the tables/dsv addon seems to have an extremely high
> > memory growth factor when processing csv data:
> >
> > load 'tables/csv'dat=: (34;'45';'hello';_5.34),:
> > 12;'32';'goodbye';1.23d=: makecsv dat# d
> > NB. 45 chars longtimespacex 'fixcsv d'NB. 2.28e_5 48644864 % 45 NB.
> > 108.089 factor of memory growth
> >
> > This makes loading many datasets effectively impossible even on
> > reasonably specced machines.
> >
> > A 1GB csv file would require 108GB of memory to load which seems
> > fairly extreme to the point where I would consider this a bug.
> >
> > Someone on irc mentioned that generally larger datasets should be
> > loaded in to jd and that's fair enough but I still would expect to be
> > able to load csv data reasonably quickly and memory efficiently.
> >
> > Is this a bug? Is there a better library to use for csv data?
> >
> > Cheers,
> >
> > Aaron.
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm