Hello Matthew;
Okay, one minute (60 seconds) will suffice then.
The conjunction idea is essentially correct. A 'filterFile' conjunction
could be written like so:
TargetFile ChunkSize filterFile FilterVerb SourceFile
This would be the modern form of a flat APL utility I recall developing,
back when 10MB and 512KB were the hard drive and RAM sizes
respectively. There will always be a bigger data source. I recall
manually adjusting the chunk size as time passed, benchmarking all the
way. FilterVerb was a direct definition string, a pre-dynamic function
if you will.
This tool was also be tweaked at some point to chunk on newline
boundaries. Generalizing to a 'big cut' looked feasible enough that it
never got implemented.
Matthew Brand wrote:
it can be done in 43 seconds using C# for comparison:
DateTime t = DateTime.Now;
StreamReader sr = new StreamReader(@"C:\\input.csv");
StreamWriter sw = new StreamWriter(@"C:\\output.csv");
string s = "";
string[] slt=null;
string news="";
while(!sr.EndOfStream)
{
s=sr.ReadLine();
news = "";
slt = s.Split(',');
foreach(string p in slt)
{
if (p == "")
news += "0";
else
news += p;
news += ",";
}
news = news.Remove(news.Length-1);
sw.WriteLine(news);
}
sw.Close();
sr.Close();
MessageBox.Show("Took " + DateTime.Now.Subtract(t).Seconds + "
seconds");
I think the main problem in J is running out of ram during the process. I
tried the (#!.'0'~ 1 j. ',,' E. ]) solution on a 4Gb 64-bit XP machine and
it was very slow, it started to use the page file in task manager - at that
point I stopped the process. I think that given enough ram J would do it
quickly with (#!.'0'~ 1 j. ',,' E. ]) ... but what about a 1Gb file, or a
10Gb file?
I don't think that 138MB is considered very large these days. Intraday
trading data or output of global climate models can easily be larger than
this.
I was wondering whether it is possible to write an adverb that auto
splits whatever is coming in to it on the basis of availible memory or to
make some kind of chunkifiaction happen automatically... but don't think it
would be easy to do. One might input a set of asserts to guide the splitting
process and use info about the ram size to determine wether it is neccesary
and how many splits to do.
The chunkify method takes 86 seconds, which is good enough for me at the
moment...
require'jmf' NB. map file utilities loaded in jmf
require'files dir'
textfile =. 'C:\input.csv'
JCHAR map_jmf_ 'bigtext';textfile
chunk_idx =. (i.@:<.&.(%&chunk_size <i.@:%3C.&.(%25&chunk_size> =:
10000))@:#
chunkify_mask =. (($@:[ $ 0"_) (1"_)`]`[} _1 , ] + ',,' -:"1 ({~ (,.
:))) chunk_idx
null2zero =. #!.'0'~ 1 j.',,' E. ]
ts=: 6!:2, 7!:[EMAIL PROTECTED]
ts ' ((;@:(<@:null2zero;.2)~ chunkify_mask) bigtext ) 1!:2
<''C:\output.csv'' '
85.6266 5.38447e8
It would be nice if this chunkification just happened somehow in the
background though and all you needed to write is:
require'jmf' NB. map file utilities loaded in jmf
require'files dir'
textfile =. 'C:\input.csv'
JCHAR map_jmf_ 'bigtext';textfile
( #!.'0'~ 1 j.',,' E. ]) MEMHANDLE (assert1`assert2`...) bigtext
Where MEMHANDLE is a conjunction which manages the splitting and assert1,...
are a list of things that have to hold true in each split.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
--
------------------------------------------------------------------------
|\/| Randy A MacDonald | APL: If you can say it, it's done.. (ram)
|/\| ramacd <at> nbnet.nb.ca |
|\ | | The only real problem with APL is that
BSc(Math) UNBF'83 | it is "still ahead of its time."
Sapere Aude | - Morten Kromberg
Natural Born APL'er |
-----------------------------------------------------(INTP)----{ gnat }-
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm