Re: [Jprogramming] Insert zeroes into large data file

Randy MacDonald Tue, 05 Aug 2008 13:48:32 -0700

Hello Matthew;

Okay, one minute (60 seconds) will suffice then.

The conjunction idea is essentially correct. A 'filterFile' conjunctioncould be written like so:


  TargetFile ChunkSize filterFile FilterVerb SourceFile

This would be the modern form of a flat APL utility I recall developing,back when 10MB and 512KB were the hard drive and RAM sizesrespectively. There will always be a bigger data source. I recallmanually adjusting the chunk size as time passed, benchmarking all theway. FilterVerb was a direct definition string, a pre-dynamic functionif you will.

This tool was also be tweaked at some point to chunk on newlineboundaries. Generalizing to a 'big cut' looked feasible enough that itnever got implemented.



Matthew Brand wrote:

it can be done in 43 seconds using C#  for comparison:

          DateTime t = DateTime.Now;

            StreamReader sr = new StreamReader(@"C:\\input.csv");

            StreamWriter sw = new StreamWriter(@"C:\\output.csv");

            string s = "";

            string[] slt=null;

            string news="";



            while(!sr.EndOfStream)

            {

                s=sr.ReadLine();

                news = "";

                slt = s.Split(',');

                foreach(string p in slt)

                {

                    if (p == "")

                        news += "0";

                    else

                        news += p;

                    news += ",";

                }

                news = news.Remove(news.Length-1);

                sw.WriteLine(news);

            }

            sw.Close();

            sr.Close();

            MessageBox.Show("Took " + DateTime.Now.Subtract(t).Seconds + "
seconds");

I think the main problem in J is running out of ram during the process. I
tried the (#!.'0'~ 1 j. ',,' E. ]) solution on a 4Gb 64-bit XP machine and
it was very slow, it started to use the page file in task manager - at that
point I stopped the process. I think that given enough ram J would do it
quickly with (#!.'0'~ 1 j. ',,' E. ]) ... but what about a 1Gb file, or a
10Gb file?

I don't think that 138MB is considered very large these days. Intraday
trading data or output of global climate models can easily be larger than
this.

I was wondering whether it is possible to write an adverb that auto
splits whatever is coming in to it on the basis of availible memory or to
make some kind of chunkifiaction happen automatically... but don't think it
would be easy to do. One might input a set of asserts to guide the splitting
process and use info about the ram size to determine wether it is neccesary
and how many splits to do.

The chunkify method takes 86 seconds, which is good enough for me at the
moment...

   require'jmf'        NB. map file utilities loaded in jmf
   require'files dir'
   textfile =. 'C:\input.csv'
   JCHAR map_jmf_ 'bigtext';textfile
   chunk_idx     =.  (i.@:<.&.(%&chunk_size <i.@:%3C.&.(%25&chunk_size> =:
10000))@:#
   chunkify_mask =.  (($@:[ $ 0"_) (1"_)`]`[} _1 , ] + ',,' -:"1 ({~ (,.

:))) chunk_idx

   null2zero     =.  #!.'0'~ 1 j.',,' E. ]
   ts=: 6!:2, 7!:[EMAIL PROTECTED]
   ts ' ((;@:(<@:null2zero;.2)~ chunkify_mask) bigtext ) 1!:2
<''C:\output.csv'' '
85.6266 5.38447e8

 It would be nice if this chunkification just happened somehow in the
background though and all you needed to write is:

  require'jmf'        NB. map file utilities loaded in jmf
   require'files dir'
   textfile =. 'C:\input.csv'
   JCHAR map_jmf_ 'bigtext';textfile

 (  #!.'0'~ 1 j.',,' E. ]) MEMHANDLE (assert1`assert2`...) bigtext

Where MEMHANDLE is a conjunction which manages the splitting and assert1,...
are a list of things that have to hold true in each split.
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm


--
------------------------------------------------------------------------
|\/| Randy A MacDonald       | APL: If you can say it, it's done.. (ram)
|/\| ramacd <at> nbnet.nb.ca |
|\ |                         | The only real problem with APL is that
BSc(Math) UNBF'83            | it is "still ahead of its time."
Sapere Aude                  |     - Morten Kromberg
Natural Born APL'er          |
-----------------------------------------------------(INTP)----{ gnat }-



----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Insert zeroes into large data file

Reply via email to