Hi All,
I have a very large data file which contains comma seperated values (some
columns are non numeric data). It is so big that doing anything with it
apart from very basic things in 32-bit seems to run out of memory:
JCHAR map_jmf_ 'text';textfile
$text
134125010
It contains many "missing values" that are denoted by two commas with
nothing inbetween. I want to read in the file and output a new one with
zeroes inbetween the two commas.
E.g. if the input file is:
text =.'0,,34567,,abcd,,efg'
then the output should be:
'0,0,34567,0,abcd,0,efg'
I am really struggling with this. I have managed to get on a 64 bit XP
machine to overcome the out-of-memory errors, but everything I come up with
is so slow that I just kill the process and have a rethink.
This is what I have come up with so far:
text =.'0,,34567,,abcd,,efg'
commas =. ( [: I. ',,'&E. ) text
start =. 0, commas + 1
end =. (<: # text) _1} 1 |. <: start
grab =. ( [: < {. + [: i. [: >: }.-{. )"_ 1 start,.end
output =. ; ( '0' ,~ [: ] {&text )&.> grab
But if I run it on the read file it is too slow.
I also came up with this:
; 2 ([: < ((0&{)`(',0'"_))@.(([: ','&= 0&{ ) *. ([: ','&= _1&{ )))\
'0,0,34567,0,abcd,0,efg'
0,0,34567,0,abcd,0,ef
but again it is too slow on the massive file (last character missed, but
easy to fix that) and runs out of memory on a 32-bit machine.
Does anyone know a faster algorithm to do this on such a large file? Can it
be done in a 32-bit address space? The problem can be solved by streaming
through the data in C++, but I want know how to do it in J efficiently
without using explicit loops.
Thanks,
Matthew.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm