Aaron,

Jd (addon data/jd) has a fast and memory efficient csv loader. This may be
overkill for your requirements, but might be worth a look.

On Tue, May 5, 2020 at 9:09 AM Devon McCormick <[email protected]> wrote:

> I don't know if this helps, but I wrote an adverb called "doSomething"
> which applies a verb to a file in pieces:
> https://code.jsoftware.com/wiki/User:Devon_McCormick/Code/WorkOnLargeFiles
> .
> There is an example of using it to work on large tab-delimited files here:
> https://code.jsoftware.com/wiki/User:Devon_McCormick/Code/largeFileVet .
>
> On Tue, May 5, 2020 at 3:46 AM John Baker <[email protected]> wrote:
>
> > I've hit this problem many times over the years. One way to tackle the
> > memory consumption is to load the CSV as
> > symbols. Many CSV/TAB delimited files have highly repetitive columns. A
> > symbol is stored only once so repeated
> > columns do not consume vast amounts of memory..
> >
> > Here are two verbs I use to load TAB delimited files (basically CSVs
> using
> > TABs as the delimiter.
> >
> > NB. read TAB delimited table files - faster than (readtd) - see long
> > document
> >
> > readtd2=:[: <;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:) }.
> > (10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0)))
> >
> >
> > NB. read TAB delimited table files as symbols - see long document
> >
> > readtd2s=:[: s:@<;._2&> (9{a.) ,&.>~ [: <;._2 [: (] , ((10{a.)"_ = {:)
> }.
> > (10{a.)"_) (13{a.) -.~ 1!:1&(]`<@.(32&>@(3!:0)))
> >
> >
> > Enjoy
> >
> >
> >
> >
> > On Mon, May 4, 2020 at 11:10 PM Ric Sherlock <[email protected]> wrote:
> >
> > >  The current parsers in tables/dsv deal with the worst case scenario
> > (e.g.
> > > type is mixed within and between columns, fields can contain field
> and/or
> > > record delimiters). The more you can assume that these edge cases are
> not
> > > present, the more you can reduce memory usage. If for example you know
> > that
> > > there is no quoting of text, and no field delimiters that exist within
> > > fields then you rather than using the tables/dsv addon,
> > >
> > > (1024^2) %~ 7!:2 'readcsv ''myfile_20MB.csv'' '
> > > 487.617
> > >
> > > you could just do:
> > >
> > > dat=: (','&cut);._2 freads 'myfile_20MB.csv'
> > >
> > > This decreases the memory required for processing.
> > >
> > > (1024^2) %~ 7!:2 ' ('',''&cut);._2 freads ''myfile_20MB.csv'' '
> > > 359.379
> > >
> > > If you have well-formed, simple numeric files (no header), then much of
> > the
> > > processing done by the addon is superfluous. For example the same file
> > (but
> > > space-delimited) could be read into a numeric table (not boxed text)
> like
> > > this:
> > >
> > > dat=: _999 ". ];._2 freads 'myfile_20MB.txt'
> > >
> > > (1024^2) %~ 7!:2 '_999 ". ];._2 freads ''myfile_20MB.txt'' '
> > > 160.002
> > >
> > > Doing the conversion to numeric a line-at-a-time can further decrease
> > total
> > > memory used.
> > >
> > > (1024^2) %~ 7!:2 '_999&".;._2 freads ''myfile_20MB.txt'' '
> > > 96.2676
> > >
> > >
> > > In the end either to cope with very large files, probably the best way
> of
> > > reducing memory usage (other than switching to another parser e.g. the
> Jd
> > > csv reader), would be to process chunks of the file at a time. This
> > > technique is used in other languages (e.g. python/pandas) as well. If
> > > someone was looking for a project then providing the option to
> > > automatically read (and write) files in chunks would be a nice
> extension
> > to
> > > the tables/dsv addon.
> > >
> > > Cheers,
> > >
> > > On Tue, May 5, 2020 at 4:31 PM bill lam <[email protected]> wrote:
> > >
> > > > I repeated your test with much large size, dat1=: 1e6#dat
> > > > the memory usage is 36x times the byte size of csv.
> > > > I think this is reasonable for J, because it used several integer
> > > > arrays of the same length as the csv character. But each integer is
> > > > 8 byte long and total byte size of 4 such integer array is already
> 32x
> > > the
> > > > byte size of csv.
> > > >
> > > > I don't think this is a bug in J. If you concern memory usage
> > efficiency,
> > > > you should do it in C. Putting it the other way, if efficient csv can
> > be
> > > > done using
> > > > J script, then special csv code in Jd is not needed.
> > > >
> > > >
> > > > On Tue, May 5, 2020 at 11:55 AM Aaron Ash <[email protected]>
> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I've noticed that the tables/dsv addon seems to have an extremely
> > high
> > > > > memory growth factor when processing csv data:
> > > > >
> > > > > load 'tables/csv'dat=: (34;'45';'hello';_5.34),:
> > > > > 12;'32';'goodbye';1.23d=: makecsv dat# d
> > > > > NB. 45 chars longtimespacex 'fixcsv d'NB. 2.28e_5 48644864 % 45 NB.
> > > > > 108.089 factor of memory growth
> > > > >
> > > > > This makes loading many datasets effectively impossible even on
> > > > > reasonably specced machines.
> > > > >
> > > > > A 1GB csv file would require 108GB of memory to load which seems
> > > > > fairly extreme to the point where I would consider this a bug.
> > > > >
> > > > > Someone on irc mentioned that generally larger datasets should be
> > > > > loaded in to jd and that's fair enough but I still would expect to
> be
> > > > > able to load csv data reasonably quickly and memory efficiently.
> > > > >
> > > > > Is this a bug? Is there a better library to use for csv data?
> > > > >
> > > > > Cheers,
> > > > >
> > > > > Aaron.
> > > > >
> > ----------------------------------------------------------------------
> > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> >
> >
> > --
> > John D. Baker
> > [email protected]
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
> --
>
> Devon McCormick, CFA
>
> Quantitative Consultant
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to