Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

bill lam Wed, 08 Sep 2021 03:33:02 -0700

IIRC expat uses the push model to get elements and should be memory
efficient. Please try the approach suggested by Raul to see if there will
be any improvement.


On Tue, Sep 7, 2021, 7:57 PM Mariusz Grasko <[email protected]>
wrote:

> So after some break and experimenting with other stuff I have revisited
> api/expat. It seems that expat, the C library for stream-oriented XML
> parsing, should not load elements to RAM and only process as you go,
> yielding tokens like StartElements, chardata, attributes etc. In this J
> library it seems to not be a case, unless I am misunderstanding how to use
> it (which is most likely explanation).
>
> This is my version of dumb parser that should do nothing, just pass through
> a file without capturing any data inside variables:
>
> NB. PROGRAM START
>
> require 'api/expat'
>
> coinsert 'jexpat'
>
>
> expat_initx=: 3 : 0
>
> id_offset=: y
>
> elements=: 0 0$''
>
> idnames=: 0$<''
>
> parents=: 0$0
>
> )
>
>
> expat_start_elementx=: 4 : 0
>
> 'elm att val'=. x
>
> smoutput 7!:0''
>
> EMPTY
>
> )
>
>
> expat_end_elementx=: 3 : 0
>
> EMPTY
>
> )
>
>
> expat_parse_xmlx=: 3 : 0
>
> EMPTY
>
> )
>
>
> 1 expat_parse_xml 1!:1 <
> 'C:\Users\mariusz\Documents\Programing\J\parseXML\test.xml'
>
>
> smoutput 'FINISH'
>
> NB. PROGRAM END
>
>
> I have added smoutput 7!:0'' to confirm memory usage increase. I already
> knew that because my large file would eventually crash Windows. Small files
> take small amount of memory and larger files take more (not sure if it
> linear realationship, haven't tested it). Enough to say that 296MB file
> would eventually get over 6GB of RAM usage.
>
> My quesses as someone who is just starting J:
>
> 1). 'elm att val'=. x this line is responsible for this memory spike, maybe
> there is a way to free up those variables before parsing next start element
> ?
>
> 2). There is some accumulation happening befind the curtains maybe for
> buffering and speeding up purposes - is there a way to limit it or reclaim
> memory ?
>
> 3). My dumb parser is actually not just running through file and does
> something which I don't understand that causes it to capture data.
>
> 4). This impelenation of J expat is not stream-parser it just utilizes
> Expat, but loads whole document to memory beforehand. (but then why does
> memory increase take time to read next element which is confirmed by
> smoutput 7!:0''
>
>
> Best regards,
>
> M.G.
>
> wt., 17 sie 2021 o 20:45 Jan-Pieter Jacobs <[email protected]>
> napisał(a):
>
> > Just as no one has not been mentioned before: Memory-mapped files also
> work
> > very well for big data files.
> >
> >
> > See the "Mapped Files" lab (Help > Studio > Labs > Mapped Files) and
> > "doc_jmf_ [ load'jmf' " in a J session for a short summary.
> >
> > You can map any file as character data with
> >
> > JCHAR map_jmf_ 'var';'filename.csv'
> >
> > Now the variable 'var' will look as if it contains all your data.
> > 2 warnings though:
> > 1) Mind the mode: the default mode to map a file is RW, so changes to the
> > variable are immediately written to disk. You can use MTRO_jmf_ as mode
> for
> > avoiding this kind of mistakes.
> > 2) Watch out with operations that might copy a big chunk of the variable,
> > (e.g. var2=:var ; var2=: }. var ; ...) which could make your program, or
> > others crash due to memory exhaustion.
> >   Things like this do work well though:
> >   LF +/@:= var NB. number of LF's, i.e. line count
> >
> > I once used them to grok through some 4GB CSV's to split them into
> records,
> > which went quite satisfactorily. To make matters easier for myself
> > afterwards, I wrote them out just sequentially, recording in a normal
> array
> > a list of the starting point and length pairs of each records for easy
> > indexing into the files when mapped again using map_jmf_ : E.g. get
> record
> > n using: recordsJMF {~ (+i.)/ n { start_len ; or even: recordsJMF ({:@]
> {.
> > {.@] }. [) n { start_len. The last is in recent versions of J far more
> > efficient for long lengths due to using virtual nouns (see
> >
> >
> https://code.jsoftware.com/wiki/Vocabulary/SpecialCombinations#Virtual_Nouns
> > )
> >
> > Of course, what makes sense to you probably depends on your situation,
> e.g.
> > whether the data is going to change, how many times you intend to use the
> > same data, ...
> >
> > Best regards,
> > Jan-Pieter
> >
> > Op di 17 aug. 2021 om 18:56 schreef Devon McCormick <[email protected]
> >:
> >
> > > Hi Mariuz,
> > > A while back I wrote some adverbs to apply a verb across a file in
> > pieces:
> > > https://github.com/DevonMcC/JUtilities/blob/master/workOnLargeFile.ijs
> .
> > >
> > > The simplest, most general one is "doSomethingSimple".  It applies the
> > > supplied verb to successive chunks of the file and allows work already
> > done
> > > to be passed on the next iteration.
> > >
> > > NB.* doSomethingSimple: apply verb to file making minimal assumptions
> > > about file structure.
> > > doSomethingSimple=: 1 : 0
> > >    'curptr chsz max flnm passedOn'=. 5{.y
> > >    if. curptr>:max do. ch=. curptr;chsz;max;flnm
> > >    else. ch=. readChunk curptr;chsz;max;flnm
> > >        passedOn=. u (_1{ch),<passedOn  NB. Allow u's work to be passed
> on
> > > to next invocation
> > >    end.
> > >    (4{.ch),<passedOn
> > > NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize
> > > 'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in file.
> > > )
> > >
> > > The sub-function "readChunk" looks like this:
> > >
> > > readChunk=: 3 : 0
> > >    'curptr chsz max flnm'=. 4{.y
> > >    if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread
> flnm;curptr,chsz2
> > >    else. chunk=. '' end.
> > >    (curptr+chsz2);chsz2;max;flnm;chunk
> > > NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize
> > > 'bigFile.txt');'bigFile.txt'
> > > )
> > >
> > > Another adverb "doSomething" is similar but assumes you have something
> > like
> > > line delimiters and you only want to process complete lines each time
> > > through.
> > >
> > > If you get a chance to take a look at these, please let me know what
> you
> > > think.
> > >
> > > Good luck,
> > >
> > > Devon
> > >
> > >
> > >
> > > On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected]>
> wrote:
> > >
> > > > Mariuz,
> > > >
> > > > I've used the following adverb (see below) to process 4gig CSVs.
> > > Basically
> > > > it works
> > > > through the file in byte chunks.  As the j forum email tends to wreak
> > > > embedded
> > > > code you can see how this adv is used in the database ETL system that
> > > uses
> > > > it
> > > > here:
> > > >
> > > > https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf
> > > >
> > > > You might also find this amusing:
> > > >
> > > >
> > > >
> > >
> >
> https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-transform-and-load/
> > > >
> > > > ireadapply=:1 : 0
> > > >
> > > >
> > > > NB.*ireadapply v-- apply verb (u) to n byte line blocks of static
> file.
> > > >
> > > > NB.
> > > >
> > > > NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize ;<
> > > > uuData)
> > > >
> > > > NB.
> > > >
> > > > NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv'
> > > >
> > > > NB. fo=. SwiftTsvDir,'land_ItemSales.txt'
> > > >
> > > > NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply fi;fo;CRLF;20000000;<''
> > > >
> > > >
> > > > NB. file in, file out, line delimiter, block size, (u) verb data
> > > >
> > > > 'fi fo d k ud'=. y
> > > >
> > > >
> > > > p=. 0 NB. file pointer
> > > >
> > > > c=. 0 NB. block count
> > > >
> > > > s=. fsize fi NB. file bytes
> > > >
> > > > k=. k<.s NB. first block size
> > > >
> > > > NB.debug. b=. i.0 NB. block sizes (chk)
> > > >
> > > >
> > > > while. p < s do.
> > > >
> > > > 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k
> > > >
> > > > c=. >:c NB. block count
> > > >
> > > > NB. complete lines
> > > >
> > > > if. 0 = #l=. d beforelaststr r do.
> > > >
> > > > NB. final shard
> > > >
> > > > NB.debug. b=. b,#r
> > > >
> > > > u c;1;d;fo;r;<ud break.
> > > >
> > > > end.
> > > >
> > > > p=. p + #l NB. inc file pointer
> > > >
> > > > k=. k <. s - p NB. next block size
> > > >
> > > > NB.debug. b=. b,#l NB. block sizes list
> > > >
> > > > NB. block number, shard, delimiter, file out, line bytes, (u) data
> > > >
> > > > u c;0;d;fo;l;<ud
> > > >
> > > > end.
> > > >
> > > >
> > > > NB.debug. 'byte mismatch' assert s = +/b
> > > >
> > > > c NB. blocks processed
> > > >
> > > > )
> > > >
> > > > On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <[email protected]>
> > > wrote:
> > > >
> > > > > 1, As you have noticed, certainly. There's details, of course (what
> > > > > block size to use? Are files guaranteed to be well formed? If not,
> > > > > what are error conditions? (are certain characters illegal? Are
> lines
> > > > > longer than the block size allowed?) Do you want a callback
> interface
> > > > > for each block? If so, do you need an "end of file" indication? If
> > so,
> > > > > is that a separate callback or a distinct argument to the block
> > > > > callback? etc.)
> > > > >
> > > > > 2. Again, as you have noticed: yes. And, there are analogous
> details
> > > > > here...
> > > > >
> > > > > 3. The expat API should only require J knowledge. There are a
> couple
> > > > > examples in the addons/api/expat/test/ directory named test0.ijs
> and
> > > > > test1.ijs
> > > > >
> > > > > I hope this helps,
> > > > >
> > > > > --
> > > > > Raul
> > > > >
> > > > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > Thank you for some ideas on using external parser.
> > > > > > Okay now I have 3 questions:
> > > > > > 1. Is it possible to read CSV file streaming-style (for example
> > > record
> > > > by
> > > > > > record) without loading everything in memory ? Even if I use some
> > > > > external
> > > > > > parsing solution like XSLT or just write something myself in some
> > > other
> > > > > > language than J, I will end up with large CSV instead of large
> XML.
> > > It
> > > > > > makes no difference. The reason that I need to parse it like
> this,
> > is
> > > > > that
> > > > > > there are some rows that I won't need, those would be discarded
> > > > depending
> > > > > > on their field values.
> > > > > > If it is not possible I would do more work outside of J in this
> > first
> > > > > > parser XML -> CSV.
> > > > > > 2. Is there a way to call external program for J script ? If it
> is
> > > > > > possible  to wait for it to finish ?
> > > > > > If it is not possible, there are definiately ways to run J from
> > other
> > > > > > programs.
> > > > > > 3. Can someone give a little bit of pointer or on how to use
> > > api/expat
> > > > > > library ? Do I need to familiarize myself with expat (C library)
> or
> > > > just
> > > > > > good understanding of J and reading small test in package
> directory
> > > > > should
> > > > > > be enough ?
> > > > > > I could send some example file like Devon McCormick suggested.
> > > > > >
> > > > > > Right now I am working through book "J:The natural language for
> > > > analytic
> > > > > > computing" and playing around with problems like Project Euler,
> > but I
> > > > > could
> > > > > > really see myself using J in serious work.
> > > > > >
> > > > > > Best regards,
> > > > > > MG
> > > > > >
> > > > > >
> > > > > > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a):
> > > > > >
> > > > > > > In similar situations -but my files are not huge- I extract
> what
> > I
> > > > want
> > > > > > > into flattened CSV using one or more XQuery scripts, and then
> > load
> > > > the
> > > > > CSV
> > > > > > > files with J.  The code is clean, compact and easy to maintain.
> > For
> > > > > > > recurrent XQuery patterns, m4 occasionally comes to the rescue.
> > > > Expect
> > > > > > > minor portability issues when using different XQuery processors
> > > > > > > (extensions, language level...).
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Never got round to SAX parsing beyond tutorials, so I cannot
> > > compare.
> > > > > > >
> > > > > > >
> > > > > > > De : Mariusz Grasko <[email protected]>
> > > > > > > À : [email protected]
> > > > > > > Sujet : [Jprogramming] Is is good idea to use J for reading
> large
> > > XML
> > > > > > > files ?
> > > > > > > Date : 10/08/2021 18:05:45 Europe/Paris
> > > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > We are ecommerce company and have a lot of integrations with
> > > > suppliers,
> > > > > > > products info is nearly always in XML files. I am thinking
> about
> > > > using
> > > > > J as
> > > > > > > an analysis tool, do you think that working with large files
> that
> > > > need
> > > > > to
> > > > > > > be parsed SAX- style without reading everything at once is good
> > > idea
> > > > > in J ?
> > > > > > > Also is this even advantageous (as in, would code be terse).
> > Right
> > > > now
> > > > > XML
> > > > > > > parsing is done in Golang, so if parsing in J is not very good
> we
> > > > > could try
> > > > > > > to rely more on CSV exports. CSV is definiately very good in J.
> > > > > > > I am hoping that maybe XML parsing is very good in J and the
> code
> > > > would
> > > > > > > become much smaller, if this is the case, then I would think
> > about
> > > > > using J
> > > > > > > for XMLs with new suppliers.
> > > > > > >
> > > > > > > Best Regards
> > > > > > > M.G.
> > > > > > >
> > > >
> ----------------------------------------------------------------------
> > > > > > > For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > > >
> > > > > > >
> > > >
> ----------------------------------------------------------------------
> > > > > > > For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > > >
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > > For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > > >
> > ----------------------------------------------------------------------
> > > > > For information about J forums see
> > http://www.jsoftware.com/forums.htm
> > > > >
> > > >
> > > >
> > > > --
> > > > John D. Baker
> > > > [email protected]
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > >
> > >
> > >
> > > --
> > >
> > > Devon McCormick, CFA
> > >
> > > Quantitative Consultant
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Reply via email to