Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Mariusz Grasko Tue, 07 Sep 2021 14:22:31 -0700

Raul,

This large RAM usage happened for all files that I tested, there was
nothing particurarly special about them. It is mostly products details from
warehouses:
ID SKU EAN NAME DESC QNT PRICE TAX IMAGE this type of elements. Some of
them are attributes (price, tax, suggested price as attributes in price
element) but this is mostly text data inside elements. Product descriptions
usually take most space.


I will do some more experiments on this tomorrow and try tu measure when
exactly RAM usage increases. I am pretty sure it is in 'elm att val'=. x

Spike was probably bad word, there is constant increase in RAM usage, like
there is no freeing of any memory before parsing finishes. Once it
finishes, memory is freed or my system crashes if file is large enough (or
i kill process before crash). The only asignement I have is this line that
I mentioned, it seems to be required to run expat_parse_xml

Best regards,
M.G.

W dniu wtorek, 7 września 2021 Raul Miller <[email protected]>
napisał(a):

> Looking at the implementation, it's literally setting up the expat
> parser to parse your file and then responding to your callbacks.
>
> If 'elm att val'=. x is creating a memory spike for you, that suggests
> that your test file contains extremely long xml elements. (Not text
> data, not nested xml elements, but probably extremely long lists of
> attributes.)
>
> That memory is freeable when your verb execution instance exits.
>
> That said, I haven't studied expat closely, and it looks like the
> implementation expects that expat will free memory provided as
> callback arguments. If this is not the case, that would cause some
> issues.
>
> Thanks,
>
> --
> Raul
>
>
> On Tue, Sep 7, 2021 at 7:57 AM Mariusz Grasko <[email protected]>
> wrote:
> >
> > So after some break and experimenting with other stuff I have revisited
> > api/expat. It seems that expat, the C library for stream-oriented XML
> > parsing, should not load elements to RAM and only process as you go,
> > yielding tokens like StartElements, chardata, attributes etc. In this J
> > library it seems to not be a case, unless I am misunderstanding how to
> use
> > it (which is most likely explanation).
> >
> > This is my version of dumb parser that should do nothing, just pass
> through
> > a file without capturing any data inside variables:
> >
> > NB. PROGRAM START
> >
> > require 'api/expat'
> >
> > coinsert 'jexpat'
> >
> >
> > expat_initx=: 3 : 0
> >
> > id_offset=: y
> >
> > elements=: 0 0$''
> >
> > idnames=: 0$<''
> >
> > parents=: 0$0
> >
> > )
> >
> >
> > expat_start_elementx=: 4 : 0
> >
> > 'elm att val'=. x
> >
> > smoutput 7!:0''
> >
> > EMPTY
> >
> > )
> >
> >
> > expat_end_elementx=: 3 : 0
> >
> > EMPTY
> >
> > )
> >
> >
> > expat_parse_xmlx=: 3 : 0
> >
> > EMPTY
> >
> > )
> >
> >
> > 1 expat_parse_xml 1!:1 <
> > 'C:\Users\mariusz\Documents\Programing\J\parseXML\test.xml'
> >
> >
> > smoutput 'FINISH'
> >
> > NB. PROGRAM END
> >
> >
> > I have added smoutput 7!:0'' to confirm memory usage increase. I already
> > knew that because my large file would eventually crash Windows. Small
> files
> > take small amount of memory and larger files take more (not sure if it
> > linear realationship, haven't tested it). Enough to say that 296MB file
> > would eventually get over 6GB of RAM usage.
> >
> > My quesses as someone who is just starting J:
> >
> > 1). 'elm att val'=. x this line is responsible for this memory spike,
> maybe
> > there is a way to free up those variables before parsing next start
> element
> > ?
> >
> > 2). There is some accumulation happening befind the curtains maybe for
> > buffering and speeding up purposes - is there a way to limit it or
> reclaim
> > memory ?
> >
> > 3). My dumb parser is actually not just running through file and does
> > something which I don't understand that causes it to capture data.
> >
> > 4). This impelenation of J expat is not stream-parser it just utilizes
> > Expat, but loads whole document to memory beforehand. (but then why does
> > memory increase take time to read next element which is confirmed by
> > smoutput 7!:0''
> >
> >
> > Best regards,
> >
> > M.G.
> >
> > wt., 17 sie 2021 o 20:45 Jan-Pieter Jacobs <[email protected]>
> > napisał(a):
> >
> > > Just as no one has not been mentioned before: Memory-mapped files also
> work
> > > very well for big data files.
> > >
> > >
> > > See the "Mapped Files" lab (Help > Studio > Labs > Mapped Files) and
> > > "doc_jmf_ [ load'jmf' " in a J session for a short summary.
> > >
> > > You can map any file as character data with
> > >
> > > JCHAR map_jmf_ 'var';'filename.csv'
> > >
> > > Now the variable 'var' will look as if it contains all your data.
> > > 2 warnings though:
> > > 1) Mind the mode: the default mode to map a file is RW, so changes to
> the
> > > variable are immediately written to disk. You can use MTRO_jmf_ as
> mode for
> > > avoiding this kind of mistakes.
> > > 2) Watch out with operations that might copy a big chunk of the
> variable,
> > > (e.g. var2=:var ; var2=: }. var ; ...) which could make your program,
> or
> > > others crash due to memory exhaustion.
> > >   Things like this do work well though:
> > >   LF +/@:= var NB. number of LF's, i.e. line count
> > >
> > > I once used them to grok through some 4GB CSV's to split them into
> records,
> > > which went quite satisfactorily. To make matters easier for myself
> > > afterwards, I wrote them out just sequentially, recording in a normal
> array
> > > a list of the starting point and length pairs of each records for easy
> > > indexing into the files when mapped again using map_jmf_ : E.g. get
> record
> > > n using: recordsJMF {~ (+i.)/ n { start_len ; or even: recordsJMF
> ({:@] {.
> > > {.@] }. [) n { start_len. The last is in recent versions of J far more
> > > efficient for long lengths due to using virtual nouns (see
> > >
> > > https://code.jsoftware.com/wiki/Vocabulary/
> SpecialCombinations#Virtual_Nouns
> > > )
> > >
> > > Of course, what makes sense to you probably depends on your situation,
> e.g.
> > > whether the data is going to change, how many times you intend to use
> the
> > > same data, ...
> > >
> > > Best regards,
> > > Jan-Pieter
> > >
> > > Op di 17 aug. 2021 om 18:56 schreef Devon McCormick <
> [email protected]>:
> > >
> > > > Hi Mariuz,
> > > > A while back I wrote some adverbs to apply a verb across a file in
> > > pieces:
> > > > https://github.com/DevonMcC/JUtilities/blob/master/
> workOnLargeFile.ijs .
> > > >
> > > > The simplest, most general one is "doSomethingSimple".  It applies
> the
> > > > supplied verb to successive chunks of the file and allows work
> already
> > > done
> > > > to be passed on the next iteration.
> > > >
> > > > NB.* doSomethingSimple: apply verb to file making minimal assumptions
> > > > about file structure.
> > > > doSomethingSimple=: 1 : 0
> > > >    'curptr chsz max flnm passedOn'=. 5{.y
> > > >    if. curptr>:max do. ch=. curptr;chsz;max;flnm
> > > >    else. ch=. readChunk curptr;chsz;max;flnm
> > > >        passedOn=. u (_1{ch),<passedOn  NB. Allow u's work to be
> passed on
> > > > to next invocation
> > > >    end.
> > > >    (4{.ch),<passedOn
> > > > NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize
> > > > 'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in
> file.
> > > > )
> > > >
> > > > The sub-function "readChunk" looks like this:
> > > >
> > > > readChunk=: 3 : 0
> > > >    'curptr chsz max flnm'=. 4{.y
> > > >    if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread
> flnm;curptr,chsz2
> > > >    else. chunk=. '' end.
> > > >    (curptr+chsz2);chsz2;max;flnm;chunk
> > > > NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize
> > > > 'bigFile.txt');'bigFile.txt'
> > > > )
> > > >
> > > > Another adverb "doSomething" is similar but assumes you have
> something
> > > like
> > > > line delimiters and you only want to process complete lines each time
> > > > through.
> > > >
> > > > If you get a chance to take a look at these, please let me know what
> you
> > > > think.
> > > >
> > > > Good luck,
> > > >
> > > > Devon
> > > >
> > > >
> > > >
> > > > On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected]>
> wrote:
> > > >
> > > > > Mariuz,
> > > > >
> > > > > I've used the following adverb (see below) to process 4gig CSVs.
> > > > Basically
> > > > > it works
> > > > > through the file in byte chunks.  As the j forum email tends to
> wreak
> > > > > embedded
> > > > > code you can see how this adv is used in the database ETL system
> that
> > > > uses
> > > > > it
> > > > > here:
> > > > >
> > > > > https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf
> > > > >
> > > > > You might also find this amusing:
> > > > >
> > > > >
> > > > >
> > > >
> > > https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-
> transform-and-load/
> > > > >
> > > > > ireadapply=:1 : 0
> > > > >
> > > > >
> > > > > NB.*ireadapply v-- apply verb (u) to n byte line blocks of static
> file.
> > > > >
> > > > > NB.
> > > > >
> > > > > NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize
> ;<
> > > > > uuData)
> > > > >
> > > > > NB.
> > > > >
> > > > > NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv'
> > > > >
> > > > > NB. fo=. SwiftTsvDir,'land_ItemSales.txt'
> > > > >
> > > > > NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply
> fi;fo;CRLF;20000000;<''
> > > > >
> > > > >
> > > > > NB. file in, file out, line delimiter, block size, (u) verb data
> > > > >
> > > > > 'fi fo d k ud'=. y
> > > > >
> > > > >
> > > > > p=. 0 NB. file pointer
> > > > >
> > > > > c=. 0 NB. block count
> > > > >
> > > > > s=. fsize fi NB. file bytes
> > > > >
> > > > > k=. k<.s NB. first block size
> > > > >
> > > > > NB.debug. b=. i.0 NB. block sizes (chk)
> > > > >
> > > > >
> > > > > while. p < s do.
> > > > >
> > > > > 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k
> > > > >
> > > > > c=. >:c NB. block count
> > > > >
> > > > > NB. complete lines
> > > > >
> > > > > if. 0 = #l=. d beforelaststr r do.
> > > > >
> > > > > NB. final shard
> > > > >
> > > > > NB.debug. b=. b,#r
> > > > >
> > > > > u c;1;d;fo;r;<ud break.
> > > > >
> > > > > end.
> > > > >
> > > > > p=. p + #l NB. inc file pointer
> > > > >
> > > > > k=. k <. s - p NB. next block size
> > > > >
> > > > > NB.debug. b=. b,#l NB. block sizes list
> > > > >
> > > > > NB. block number, shard, delimiter, file out, line bytes, (u) data
> > > > >
> > > > > u c;0;d;fo;l;<ud
> > > > >
> > > > > end.
> > > > >
> > > > >
> > > > > NB.debug. 'byte mismatch' assert s = +/b
> > > > >
> > > > > c NB. blocks processed
> > > > >
> > > > > )
> > > > >
> > > > > On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <[email protected]
> >
> > > > wrote:
> > > > >
> > > > > > 1, As you have noticed, certainly. There's details, of course
> (what
> > > > > > block size to use? Are files guaranteed to be well formed? If
> not,
> > > > > > what are error conditions? (are certain characters illegal? Are
> lines
> > > > > > longer than the block size allowed?) Do you want a callback
> interface
> > > > > > for each block? If so, do you need an "end of file" indication?
> If
> > > so,
> > > > > > is that a separate callback or a distinct argument to the block
> > > > > > callback? etc.)
> > > > > >
> > > > > > 2. Again, as you have noticed: yes. And, there are analogous
> details
> > > > > > here...
> > > > > >
> > > > > > 3. The expat API should only require J knowledge. There are a
> couple
> > > > > > examples in the addons/api/expat/test/ directory named test0.ijs
> and
> > > > > > test1.ijs
> > > > > >
> > > > > > I hope this helps,
> > > > > >
> > > > > > --
> > > > > > Raul
> > > > > >
> > > > > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > Thank you for some ideas on using external parser.
> > > > > > > Okay now I have 3 questions:
> > > > > > > 1. Is it possible to read CSV file streaming-style (for example
> > > > record
> > > > > by
> > > > > > > record) without loading everything in memory ? Even if I use
> some
> > > > > > external
> > > > > > > parsing solution like XSLT or just write something myself in
> some
> > > > other
> > > > > > > language than J, I will end up with large CSV instead of large
> XML.
> > > > It
> > > > > > > makes no difference. The reason that I need to parse it like
> this,
> > > is
> > > > > > that
> > > > > > > there are some rows that I won't need, those would be discarded
> > > > > depending
> > > > > > > on their field values.
> > > > > > > If it is not possible I would do more work outside of J in this
> > > first
> > > > > > > parser XML -> CSV.
> > > > > > > 2. Is there a way to call external program for J script ? If
> it is
> > > > > > > possible  to wait for it to finish ?
> > > > > > > If it is not possible, there are definiately ways to run J from
> > > other
> > > > > > > programs.
> > > > > > > 3. Can someone give a little bit of pointer or on how to use
> > > > api/expat
> > > > > > > library ? Do I need to familiarize myself with expat (C
> library) or
> > > > > just
> > > > > > > good understanding of J and reading small test in package
> directory
> > > > > > should
> > > > > > > be enough ?
> > > > > > > I could send some example file like Devon McCormick suggested.
> > > > > > >
> > > > > > > Right now I am working through book "J:The natural language for
> > > > > analytic
> > > > > > > computing" and playing around with problems like Project Euler,
> > > but I
> > > > > > could
> > > > > > > really see myself using J in serious work.
> > > > > > >
> > > > > > > Best regards,
> > > > > > > MG
> > > > > > >
> > > > > > >
> > > > > > > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a):
> > > > > > >
> > > > > > > > In similar situations -but my files are not huge- I extract
> what
> > > I
> > > > > want
> > > > > > > > into flattened CSV using one or more XQuery scripts, and then
> > > load
> > > > > the
> > > > > > CSV
> > > > > > > > files with J.  The code is clean, compact and easy to
> maintain.
> > > For
> > > > > > > > recurrent XQuery patterns, m4 occasionally comes to the
> rescue.
> > > > > Expect
> > > > > > > > minor portability issues when using different XQuery
> processors
> > > > > > > > (extensions, language level...).
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Never got round to SAX parsing beyond tutorials, so I cannot
> > > > compare.
> > > > > > > >
> > > > > > > >
> > > > > > > > De : Mariusz Grasko <[email protected]>
> > > > > > > > À : [email protected]
> > > > > > > > Sujet : [Jprogramming] Is is good idea to use J for reading
> large
> > > > XML
> > > > > > > > files ?
> > > > > > > > Date : 10/08/2021 18:05:45 Europe/Paris
> > > > > > > >
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > We are ecommerce company and have a lot of integrations with
> > > > > suppliers,
> > > > > > > > products info is nearly always in XML files. I am thinking
> about
> > > > > using
> > > > > > J as
> > > > > > > > an analysis tool, do you think that working with large files
> that
> > > > > need
> > > > > > to
> > > > > > > > be parsed SAX- style without reading everything at once is
> good
> > > > idea
> > > > > > in J ?
> > > > > > > > Also is this even advantageous (as in, would code be terse).
> > > Right
> > > > > now
> > > > > > XML
> > > > > > > > parsing is done in Golang, so if parsing in J is not very
> good we
> > > > > > could try
> > > > > > > > to rely more on CSV exports. CSV is definiately very good in
> J.
> > > > > > > > I am hoping that maybe XML parsing is very good in J and the
> code
> > > > > would
> > > > > > > > become much smaller, if this is the case, then I would think
> > > about
> > > > > > using J
> > > > > > > > for XMLs with new suppliers.
> > > > > > > >
> > > > > > > > Best Regards
> > > > > > > > M.G.
> > > > > > > >
> > > > > ------------------------------------------------------------
> ----------
> > > > > > > > For information about J forums see
> > > > > http://www.jsoftware.com/forums.htm
> > > > > > > >
> > > > > > > >
> > > > > ------------------------------------------------------------
> ----------
> > > > > > > > For information about J forums see
> > > > > http://www.jsoftware.com/forums.htm
> > > > > > > >
> > > > > > >
> > > > ------------------------------------------------------------
> ----------
> > > > > > > For information about J forums see
> > > > http://www.jsoftware.com/forums.htm
> > > > > >
> > > ----------------------------------------------------------------------
> > > > > > For information about J forums see
> > > http://www.jsoftware.com/forums.htm
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > John D. Baker
> > > > > [email protected]
> > > > > ------------------------------------------------------------
> ----------
> > > > > For information about J forums see http://www.jsoftware.com/
> forums.htm
> > > > >
> > > >
> > > >
> > > > --
> > > >
> > > > Devon McCormick, CFA
> > > >
> > > > Quantitative Consultant
> > > > ------------------------------------------------------------
> ----------
> > > > For information about J forums see http://www.jsoftware.com/
> forums.htm
> > > >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Reply via email to