Hi Mariuz,
A while back I wrote some adverbs to apply a verb across a file in pieces:
https://github.com/DevonMcC/JUtilities/blob/master/workOnLargeFile.ijs .
The simplest, most general one is "doSomethingSimple". It applies the
supplied verb to successive chunks of the file and allows work already done
to be passed on the next iteration.
NB.* doSomethingSimple: apply verb to file making minimal assumptions
about file structure.
doSomethingSimple=: 1 : 0
'curptr chsz max flnm passedOn'=. 5{.y
if. curptr>:max do. ch=. curptr;chsz;max;flnm
else. ch=. readChunk curptr;chsz;max;flnm
passedOn=. u (_1{ch),<passedOn NB. Allow u's work to be passed on
to next invocation
end.
(4{.ch),<passedOn
NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize
'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in file.
)
The sub-function "readChunk" looks like this:
readChunk=: 3 : 0
'curptr chsz max flnm'=. 4{.y
if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread flnm;curptr,chsz2
else. chunk=. '' end.
(curptr+chsz2);chsz2;max;flnm;chunk
NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize 'bigFile.txt');'bigFile.txt'
)
Another adverb "doSomething" is similar but assumes you have something like
line delimiters and you only want to process complete lines each time
through.
If you get a chance to take a look at these, please let me know what you
think.
Good luck,
Devon
On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected]> wrote:
> Mariuz,
>
> I've used the following adverb (see below) to process 4gig CSVs. Basically
> it works
> through the file in byte chunks. As the j forum email tends to wreak
> embedded
> code you can see how this adv is used in the database ETL system that uses
> it
> here:
>
> https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf
>
> You might also find this amusing:
>
>
> https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-transform-and-load/
>
> ireadapply=:1 : 0
>
>
> NB.*ireadapply v-- apply verb (u) to n byte line blocks of static file.
>
> NB.
>
> NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize ;<
> uuData)
>
> NB.
>
> NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv'
>
> NB. fo=. SwiftTsvDir,'land_ItemSales.txt'
>
> NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply fi;fo;CRLF;20000000;<''
>
>
> NB. file in, file out, line delimiter, block size, (u) verb data
>
> 'fi fo d k ud'=. y
>
>
> p=. 0 NB. file pointer
>
> c=. 0 NB. block count
>
> s=. fsize fi NB. file bytes
>
> k=. k<.s NB. first block size
>
> NB.debug. b=. i.0 NB. block sizes (chk)
>
>
> while. p < s do.
>
> 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k
>
> c=. >:c NB. block count
>
> NB. complete lines
>
> if. 0 = #l=. d beforelaststr r do.
>
> NB. final shard
>
> NB.debug. b=. b,#r
>
> u c;1;d;fo;r;<ud break.
>
> end.
>
> p=. p + #l NB. inc file pointer
>
> k=. k <. s - p NB. next block size
>
> NB.debug. b=. b,#l NB. block sizes list
>
> NB. block number, shard, delimiter, file out, line bytes, (u) data
>
> u c;0;d;fo;l;<ud
>
> end.
>
>
> NB.debug. 'byte mismatch' assert s = +/b
>
> c NB. blocks processed
>
> )
>
> On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <[email protected]> wrote:
>
> > 1, As you have noticed, certainly. There's details, of course (what
> > block size to use? Are files guaranteed to be well formed? If not,
> > what are error conditions? (are certain characters illegal? Are lines
> > longer than the block size allowed?) Do you want a callback interface
> > for each block? If so, do you need an "end of file" indication? If so,
> > is that a separate callback or a distinct argument to the block
> > callback? etc.)
> >
> > 2. Again, as you have noticed: yes. And, there are analogous details
> > here...
> >
> > 3. The expat API should only require J knowledge. There are a couple
> > examples in the addons/api/expat/test/ directory named test0.ijs and
> > test1.ijs
> >
> > I hope this helps,
> >
> > --
> > Raul
> >
> > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko
> > <[email protected]> wrote:
> > >
> > > Thank you for some ideas on using external parser.
> > > Okay now I have 3 questions:
> > > 1. Is it possible to read CSV file streaming-style (for example record
> by
> > > record) without loading everything in memory ? Even if I use some
> > external
> > > parsing solution like XSLT or just write something myself in some other
> > > language than J, I will end up with large CSV instead of large XML. It
> > > makes no difference. The reason that I need to parse it like this, is
> > that
> > > there are some rows that I won't need, those would be discarded
> depending
> > > on their field values.
> > > If it is not possible I would do more work outside of J in this first
> > > parser XML -> CSV.
> > > 2. Is there a way to call external program for J script ? If it is
> > > possible to wait for it to finish ?
> > > If it is not possible, there are definiately ways to run J from other
> > > programs.
> > > 3. Can someone give a little bit of pointer or on how to use api/expat
> > > library ? Do I need to familiarize myself with expat (C library) or
> just
> > > good understanding of J and reading small test in package directory
> > should
> > > be enough ?
> > > I could send some example file like Devon McCormick suggested.
> > >
> > > Right now I am working through book "J:The natural language for
> analytic
> > > computing" and playing around with problems like Project Euler, but I
> > could
> > > really see myself using J in serious work.
> > >
> > > Best regards,
> > > MG
> > >
> > >
> > > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a):
> > >
> > > > In similar situations -but my files are not huge- I extract what I
> want
> > > > into flattened CSV using one or more XQuery scripts, and then load
> the
> > CSV
> > > > files with J. The code is clean, compact and easy to maintain. For
> > > > recurrent XQuery patterns, m4 occasionally comes to the rescue.
> Expect
> > > > minor portability issues when using different XQuery processors
> > > > (extensions, language level...).
> > > >
> > > >
> > > >
> > > > Never got round to SAX parsing beyond tutorials, so I cannot compare.
> > > >
> > > >
> > > > De : Mariusz Grasko <[email protected]>
> > > > À : [email protected]
> > > > Sujet : [Jprogramming] Is is good idea to use J for reading large XML
> > > > files ?
> > > > Date : 10/08/2021 18:05:45 Europe/Paris
> > > >
> > > > Hi,
> > > >
> > > > We are ecommerce company and have a lot of integrations with
> suppliers,
> > > > products info is nearly always in XML files. I am thinking about
> using
> > J as
> > > > an analysis tool, do you think that working with large files that
> need
> > to
> > > > be parsed SAX- style without reading everything at once is good idea
> > in J ?
> > > > Also is this even advantageous (as in, would code be terse). Right
> now
> > XML
> > > > parsing is done in Golang, so if parsing in J is not very good we
> > could try
> > > > to rely more on CSV exports. CSV is definiately very good in J.
> > > > I am hoping that maybe XML parsing is very good in J and the code
> would
> > > > become much smaller, if this is the case, then I would think about
> > using J
> > > > for XMLs with new suppliers.
> > > >
> > > > Best Regards
> > > > M.G.
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
> --
> John D. Baker
> [email protected]
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
--
Devon McCormick, CFA
Quantitative Consultant
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm