So... I'm stuck at home and am explicitly not allowed to work on
production systems, so... anyways... here's a partially tested
implementation of something close to request #4.  (I tested with
blocksize being 150 instead of 1e7 and with the test data provided):

example use:
   (jpath '~user/output') rds '/tmp/infile'

rds=:dyad define
  NB. guarantee clean slate for output directory:
  1!:5 ::0: <x NB. avoid file name error
  1!:55 <x NB. interface error if not empty
  1!:5 <x
  blocksize=. 1e7
  size=. 1!:4 <y NB. y is file name
  limit=. size-1
  'offset length'=. 0,blocksize <.limit NB. x is block size
  while. offset < size do.
    txt=. 1!:11 y;offset,length
    end=. 1+txt i: LF
    if. end <: #txt do.
      offset=. offset+end
      length=. length <. size-offset
      x emits fixs end{.txt
    else.
      echo 'this is bad';offset,length,end
      throw.
    end.
  end.
)

emits=:dyad define
  9!:11]15
  for_row. y do.
    'fnm dat'=. row
    file=. <x,'/',fnm,'.csv'
    file 1!:3~ makecsv dat
  end.
)

header=: 'ticker,date,open,high,low,close,volume',LF

fixs=:verb define
   assert. LF={:y
   if. header-:(#header){. y do.
     y=. (#header)}.y
   end.
   boxes=. ','&cut;._2 y
   labels=. ~.{."1 boxes
   dates=. _".(>1{"1 boxes)-."1 '-'
   trading=. fixoldtrades _ ".&> 2}."1 boxes
   labels ,. ({."1 boxes) </. dates,.trading
)

fixoldtrades=:verb define
   if. 1={:$y do.
     5#"1 y
   else.
     old=. I.1 0 0 0 0  -:"1 y
     if.#old do.
       (5#"0(old,each 0){y) old} y NB. untested
     else.
       y
     end.
   end.
)

cleandates=:verb define
  NB. remove '-' only from dates
  NB. assume chrono order
  dt0=. 1{::<;._1 ',',(y i.LF) {. y
  dtn=. 1{::<;._1 ',',LF-.~y}.~(}:y) i:LF
  dates=. dt0 dtrange&(0&".)&(-.&'-') dtn
  after=. ',',.":,.0 100#.~.}:"1 dates
  before=. 1 1 1 1 1j1 1 1j1 #!.'-'"1 after
  (before;"1 after) stringreplace y
)

require'dates'
dtrange=:dyad define
  todate x range&(1&todayno) y
)

range=:dyad define
  x + i.1+y-x
)

What this does is read an arbitrarily large input file of trades and
generate an output directory of trades. In the input, dates have minus
signs embedded in them, those are missing from the output directory,
for downstream performance reasons.

But I've not tested this on a large dataset and there may be
performance issues in that context which would need to be remedied. So
call this a prototype rather than production code.

FYI,

-- 
Raul



On Tue, Mar 31, 2020 at 12:58 AM HH PackRat <[email protected]> wrote:
>
> Finishing up with function #4......
>
> I have a very large file consisting of multiple sets of historical
> stock prices that I would like to split into individual files for each
> stock.  (I'll probably first have to write out all the files to a USB
> flash drive [I have limited hard drive space, but it might work as a
> very tight fit] and then, when finished, burn them to a DVD-ROM for
> more permanent storage.)  Since I thought that J was capable of
> handling very large files, I figured that this might be a challenge to
> try.
>
> Unfortunately, I don't know how to handle file reading where you might
> only be able to read a part of the file at a time.  (I don't know how
> large a file J can read--maybe it can read the whole file.)  This file
> has 14,937,606 lines and is 1.63 GB (1,759,801,721 bytes) in size.
>
> Additionally (and probably most importantly), I don't know how to
> collect a subset of the contents of a file to output to a file, and
> then resume where J left off and collect the next subset of data to
> output, and so on.
>
> I'm going to need a LOT of help with this J programming!
>
> Below is a sample of the data--5 days' worth of data for 5 different
> stocks.  The master file is a csv file, and the individual outputs (5
> in this case) should also be csv files.  (Obviously, row 0 needs to be
> ignored.)  The output files should use the ticker symbol as the name
> for each file (e.g., AA.csv).  The ticker symbol (column 0) should be
> stripped off of each line of data, with only the remainder of each row
> (date onward being retained) being cumulated for output.
>
> Please correct me if I'm wrong, but my assumption is that if code
> works for these 25 lines of data, the code ought to work as well for
> 14,937,606 lines!
>
> DATA SET D:
> __________________________________________________
>
> ticker,date,open,high,low,close,volume
> AA,2017-06-27,31.6,32.5,31.49,31.63,5463485.0
> AA,2017-06-28,32.1,33.0,31.93,32.95,3764296.0
> AA,2017-06-29,33.11,33.34,32.61,33.18,3730077.0
> AA,2017-06-30,33.16,33.45,32.535,32.65,3014777.0
> AA,2017-07-03,32.94,34.3,32.915,34.02,3112086.0
> AAPL,2017-06-28,144.49,146.11,143.1601,145.83,21915939.0
> AAPL,2017-06-29,144.71,145.13,142.28,143.68,31116980.0
> AAPL,2017-06-30,144.45,144.96,143.78,144.02,22328979.0
> AAPL,2017-07-03,144.88,145.3001,143.1,143.5,14276812.0
> AAPL,2017-07-05,143.69,144.79,142.7237,144.09,20758795.0
> GE,2017-06-28,27.26,27.4,27.05,27.08,30759065.0
> GE,2017-06-29,27.16,27.41,26.79,27.02,36443559.0
> GE,2017-06-30,27.09,27.19,26.91,27.01,25849199.0
> GE,2017-07-03,27.16,27.59,27.06,27.45,20664966.0
> GE,2017-07-05,27.54,27.56,27.23,27.35,21082332.0
> IBM,2017-06-28,155.15,155.55,154.78,155.32,2203062.0
> IBM,2017-06-29,155.35,155.74,153.62,154.13,3245649.0
> IBM,2017-06-30,154.28,154.5,153.14,153.83,3501395.0
> IBM,2017-07-03,153.58,156.025,153.52,155.58,2822499.0
> IBM,2017-07-05,155.77,155.89,153.63,153.67,3558639.0
> T,2017-06-28,37.88,38.065,37.78,37.94,20312146.0
> T,2017-06-29,37.87,37.98,37.62,37.62,23508452.0
> T,2017-06-30,37.73,37.87,37.54,37.73,22303282.0
> T,2017-07-03,37.84,38.13,37.785,38.11,11123146.0
> T,2017-07-05,38.11,38.21,37.85,38.12,19644726.0
> __________________________________________________
>
> SUPER thanks in advance for any and all help with this one!
>
> Harvey
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to