It is still interesting to use mapped files to
fit the solution. However, not arbitrary J execution
makes use of the mapped space offered, so it should
be done with caution.
Here's a process that does it; a very precise sequence
of steps with mapped files. Note mem peaks from Task Manager.
Also temp memory is used is many steps (whereas seemingly
the provided mapped space could be used). So "out of memory"
can occur is more than one large temp space is allocated.
To prevent this, here each operation is broken into small
assigned steps.
NB. =========================================================
load'jmf files'
nsizes=: (,. <@(7!:5) ,. <@(4!:0))@nl
Note 'interactive test' NB. replace ",," with "0"
NB. create dummy data
N=. 134125010
f1=. N$'qq,,zz,,125',LF NB. peak: 139 Mb
f1 1!:2 <jpath'~temp/f1'
erase <'f1'
NB. actuall process
JCHAR map_jmf_ 'f1';jpath'~temp/f1'
$f1
]P=. ',,' +/@E. f1
createjmf_jmf_ (jpath'~temp/f2');4*1+N
(fsize jpath'~temp/f2'),N
map_jmf_ 'f2';jpath'~temp/f2'
f2=: ',,' E. f1 NB. peak: 402 Mb
f2=: f2+1 NB. peak: 1,190 Mb
f2=: 0,f2 NB. ]
f2=: +/\f2 NB. ] here large tmp mem
f2=: }:f2 NB. ]
createjmf_jmf_ (jpath'~temp/f3');N+P
map_jmf_ 'f3';jpath'~temp/f3'
f3=. (N+P)$'0'
f3=. f1 f2}f3 NB. peak: 1,343 Mb
f3 1!:2 <jpath'~temp/f4'
nsizes''
unmapall_jmf_''
)
NB. =========================================================
nsizes''
+------+---------+-+
|N |64 |0|
+------+---------+-+
|P |64 |0|
+------+---------+-+
|f1 |1.34218e8|0|
+------+---------+-+
|f2 |5.36871e8|0|
+------+---------+-+
|f3 |2.68435e8|0|
+------+---------+-+
|nsizes|1088 |3|
+------+---------+-+
load'dir'
dir jpath'~temp/*.'
f1 134125010 29-Jul-08 20:08:22
f2 536500328 29-Jul-08 20:08:23
f3 156479462 29-Jul-08 20:08:28
f4 156479178 29-Jul-08 20:08:34
> From: Matthew Brand <[EMAIL PROTECTED]>
>
> Hi All,
>
> I have a very large data file which contains comma seperated values (some
> columns are non numeric data). It is so big that doing anything with it
> apart from very basic things in 32-bit seems to run out of memory:
>
> JCHAR map_jmf_ 'text';textfile
> $text
> 134125010
>
> It contains many "missing values" that are denoted by two commas with
> nothing inbetween. I want to read in the file and output a new one with
> zeroes inbetween the two commas.
>
> E.g. if the input file is:
> text =.'0,,34567,,abcd,,efg'
>
> then the output should be:
> '0,0,34567,0,abcd,0,efg'
>
> I am really struggling with this. I have managed to get on a 64 bit XP
> machine to overcome the out-of-memory errors, but everything I come up with
> is so slow that I just kill the process and have a rethink.
>
> This is what I have come up with so far:
> text =.'0,,34567,,abcd,,efg'
> commas =. ( [: I. ',,'&E. ) text
> start =. 0, commas + 1
> end =. (<: # text) _1} 1 |. <: start
> grab =. ( [: < {. + [: i. [: >: }.-{. )"_ 1 start,.end
> output =. ; ( '0' ,~ [: ] {&text )&.> grab
>
> But if I run it on the read file it is too slow.
>
> I also came up with this:
> ; 2 ([: < ((0&{)`(',0'"_))@.(([: ','&= 0&{ ) *. ([: ','&= _1&{ )))\
> '0,0,34567,0,abcd,0,efg'
> 0,0,34567,0,abcd,0,ef
>
> but again it is too slow on the massive file (last character missed, but
> easy to fix that) and runs out of memory on a 32-bit machine.
>
> Does anyone know a faster algorithm to do this on such a large file? Can it
> be done in a 32-bit address space? The problem can be solved by streaming
> through the data in C++, but I want know how to do it in J efficiently
> without using explicit loops.
>
> Thanks,
> Matthew.
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm