Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Jan-Pieter Jacobs Tue, 17 Aug 2021 11:45:15 -0700

Just as no one has not been mentioned before: Memory-mapped files also work
very well for big data files.



See the "Mapped Files" lab (Help > Studio > Labs > Mapped Files) and
"doc_jmf_ [ load'jmf' " in a J session for a short summary.

You can map any file as character data with

JCHAR map_jmf_ 'var';'filename.csv'

Now the variable 'var' will look as if it contains all your data.
2 warnings though:
1) Mind the mode: the default mode to map a file is RW, so changes to the
variable are immediately written to disk. You can use MTRO_jmf_ as mode for
avoiding this kind of mistakes.
2) Watch out with operations that might copy a big chunk of the variable,
(e.g. var2=:var ; var2=: }. var ; ...) which could make your program, or
others crash due to memory exhaustion.
  Things like this do work well though:
  LF +/@:= var NB. number of LF's, i.e. line count

I once used them to grok through some 4GB CSV's to split them into records,
which went quite satisfactorily. To make matters easier for myself
afterwards, I wrote them out just sequentially, recording in a normal array
a list of the starting point and length pairs of each records for easy
indexing into the files when mapped again using map_jmf_ : E.g. get record
n using: recordsJMF {~ (+i.)/ n { start_len ; or even: recordsJMF ({:@] {.
{.@] }. [) n { start_len. The last is in recent versions of J far more
efficient for long lengths due to using virtual nouns (see
https://code.jsoftware.com/wiki/Vocabulary/SpecialCombinations#Virtual_Nouns
)

Of course, what makes sense to you probably depends on your situation, e.g.
whether the data is going to change, how many times you intend to use the
same data, ...

Best regards,
Jan-Pieter

Op di 17 aug. 2021 om 18:56 schreef Devon McCormick <[email protected]>:

> Hi Mariuz,
> A while back I wrote some adverbs to apply a verb across a file in pieces:
> https://github.com/DevonMcC/JUtilities/blob/master/workOnLargeFile.ijs .
>
> The simplest, most general one is "doSomethingSimple".  It applies the
> supplied verb to successive chunks of the file and allows work already done
> to be passed on the next iteration.
>
> NB.* doSomethingSimple: apply verb to file making minimal assumptions
> about file structure.
> doSomethingSimple=: 1 : 0
>    'curptr chsz max flnm passedOn'=. 5{.y
>    if. curptr>:max do. ch=. curptr;chsz;max;flnm
>    else. ch=. readChunk curptr;chsz;max;flnm
>        passedOn=. u (_1{ch),<passedOn  NB. Allow u's work to be passed on
> to next invocation
>    end.
>    (4{.ch),<passedOn
> NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize
> 'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in file.
> )
>
> The sub-function "readChunk" looks like this:
>
> readChunk=: 3 : 0
>    'curptr chsz max flnm'=. 4{.y
>    if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread flnm;curptr,chsz2
>    else. chunk=. '' end.
>    (curptr+chsz2);chsz2;max;flnm;chunk
> NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize
> 'bigFile.txt');'bigFile.txt'
> )
>
> Another adverb "doSomething" is similar but assumes you have something like
> line delimiters and you only want to process complete lines each time
> through.
>
> If you get a chance to take a look at these, please let me know what you
> think.
>
> Good luck,
>
> Devon
>
>
>
> On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected]> wrote:
>
> > Mariuz,
> >
> > I've used the following adverb (see below) to process 4gig CSVs.
> Basically
> > it works
> > through the file in byte chunks.  As the j forum email tends to wreak
> > embedded
> > code you can see how this adv is used in the database ETL system that
> uses
> > it
> > here:
> >
> > https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf
> >
> > You might also find this amusing:
> >
> >
> >
> https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-transform-and-load/
> >
> > ireadapply=:1 : 0
> >
> >
> > NB.*ireadapply v-- apply verb (u) to n byte line blocks of static file.
> >
> > NB.
> >
> > NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize ;<
> > uuData)
> >
> > NB.
> >
> > NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv'
> >
> > NB. fo=. SwiftTsvDir,'land_ItemSales.txt'
> >
> > NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply fi;fo;CRLF;20000000;<''
> >
> >
> > NB. file in, file out, line delimiter, block size, (u) verb data
> >
> > 'fi fo d k ud'=. y
> >
> >
> > p=. 0 NB. file pointer
> >
> > c=. 0 NB. block count
> >
> > s=. fsize fi NB. file bytes
> >
> > k=. k<.s NB. first block size
> >
> > NB.debug. b=. i.0 NB. block sizes (chk)
> >
> >
> > while. p < s do.
> >
> > 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k
> >
> > c=. >:c NB. block count
> >
> > NB. complete lines
> >
> > if. 0 = #l=. d beforelaststr r do.
> >
> > NB. final shard
> >
> > NB.debug. b=. b,#r
> >
> > u c;1;d;fo;r;<ud break.
> >
> > end.
> >
> > p=. p + #l NB. inc file pointer
> >
> > k=. k <. s - p NB. next block size
> >
> > NB.debug. b=. b,#l NB. block sizes list
> >
> > NB. block number, shard, delimiter, file out, line bytes, (u) data
> >
> > u c;0;d;fo;l;<ud
> >
> > end.
> >
> >
> > NB.debug. 'byte mismatch' assert s = +/b
> >
> > c NB. blocks processed
> >
> > )
> >
> > On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <[email protected]>
> wrote:
> >
> > > 1, As you have noticed, certainly. There's details, of course (what
> > > block size to use? Are files guaranteed to be well formed? If not,
> > > what are error conditions? (are certain characters illegal? Are lines
> > > longer than the block size allowed?) Do you want a callback interface
> > > for each block? If so, do you need an "end of file" indication? If so,
> > > is that a separate callback or a distinct argument to the block
> > > callback? etc.)
> > >
> > > 2. Again, as you have noticed: yes. And, there are analogous details
> > > here...
> > >
> > > 3. The expat API should only require J knowledge. There are a couple
> > > examples in the addons/api/expat/test/ directory named test0.ijs and
> > > test1.ijs
> > >
> > > I hope this helps,
> > >
> > > --
> > > Raul
> > >
> > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko
> > > <[email protected]> wrote:
> > > >
> > > > Thank you for some ideas on using external parser.
> > > > Okay now I have 3 questions:
> > > > 1. Is it possible to read CSV file streaming-style (for example
> record
> > by
> > > > record) without loading everything in memory ? Even if I use some
> > > external
> > > > parsing solution like XSLT or just write something myself in some
> other
> > > > language than J, I will end up with large CSV instead of large XML.
> It
> > > > makes no difference. The reason that I need to parse it like this, is
> > > that
> > > > there are some rows that I won't need, those would be discarded
> > depending
> > > > on their field values.
> > > > If it is not possible I would do more work outside of J in this first
> > > > parser XML -> CSV.
> > > > 2. Is there a way to call external program for J script ? If it is
> > > > possible  to wait for it to finish ?
> > > > If it is not possible, there are definiately ways to run J from other
> > > > programs.
> > > > 3. Can someone give a little bit of pointer or on how to use
> api/expat
> > > > library ? Do I need to familiarize myself with expat (C library) or
> > just
> > > > good understanding of J and reading small test in package directory
> > > should
> > > > be enough ?
> > > > I could send some example file like Devon McCormick suggested.
> > > >
> > > > Right now I am working through book "J:The natural language for
> > analytic
> > > > computing" and playing around with problems like Project Euler, but I
> > > could
> > > > really see myself using J in serious work.
> > > >
> > > > Best regards,
> > > > MG
> > > >
> > > >
> > > > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a):
> > > >
> > > > > In similar situations -but my files are not huge- I extract what I
> > want
> > > > > into flattened CSV using one or more XQuery scripts, and then load
> > the
> > > CSV
> > > > > files with J.  The code is clean, compact and easy to maintain. For
> > > > > recurrent XQuery patterns, m4 occasionally comes to the rescue.
> > Expect
> > > > > minor portability issues when using different XQuery processors
> > > > > (extensions, language level...).
> > > > >
> > > > >
> > > > >
> > > > > Never got round to SAX parsing beyond tutorials, so I cannot
> compare.
> > > > >
> > > > >
> > > > > De : Mariusz Grasko <[email protected]>
> > > > > À : [email protected]
> > > > > Sujet : [Jprogramming] Is is good idea to use J for reading large
> XML
> > > > > files ?
> > > > > Date : 10/08/2021 18:05:45 Europe/Paris
> > > > >
> > > > > Hi,
> > > > >
> > > > > We are ecommerce company and have a lot of integrations with
> > suppliers,
> > > > > products info is nearly always in XML files. I am thinking about
> > using
> > > J as
> > > > > an analysis tool, do you think that working with large files that
> > need
> > > to
> > > > > be parsed SAX- style without reading everything at once is good
> idea
> > > in J ?
> > > > > Also is this even advantageous (as in, would code be terse). Right
> > now
> > > XML
> > > > > parsing is done in Golang, so if parsing in J is not very good we
> > > could try
> > > > > to rely more on CSV exports. CSV is definiately very good in J.
> > > > > I am hoping that maybe XML parsing is very good in J and the code
> > would
> > > > > become much smaller, if this is the case, then I would think about
> > > using J
> > > > > for XMLs with new suppliers.
> > > > >
> > > > > Best Regards
> > > > > M.G.
> > > > >
> > ----------------------------------------------------------------------
> > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > > > >
> > ----------------------------------------------------------------------
> > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> >
> >
> > --
> > John D. Baker
> > [email protected]
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
>
>
> --
>
> Devon McCormick, CFA
>
> Quantitative Consultant
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Reply via email to