Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Mariusz Grasko Wed, 08 Sep 2021 01:47:58 -0700

Thank you for taking time to look at this.

Is this something wrong with how the FFI layer that calls expat is written
or could someone just tweak J part for it to work like that ?


I will try to understand how those FFI libraries work, do I just need to
only know C language or are there some other prerequisites to understand it
?

Best regards,
M.G.


śr., 8 wrz 2021 o 02:04 Raul Miller <[email protected]> napisał(a):

> Hmm....
>
> I think I see at least a part of the issues you are talking about.
>
> Currently, J's api/expat expects you to provide the complete contents
> of the xml document when you call expat_parse_xml
>
> For large files, it would make sense to have a similarly structured
> expat_parse_file which takes a file name argument and parses it a
> block at a time. (And, perhaps this should have an optional block size
> argument -- though maybe a default of one megabyte should work for
> most contexts).
>
> The steadily increasing memory use while it's parsing suggests that
> expat itself might only be freeing its memory after it parses the
> block of text which it was given.
>
> Thanks,
>
> --
> Raul
>
> On Tue, Sep 7, 2021 at 5:22 PM Mariusz Grasko <[email protected]>
> wrote:
> >
> > Raul,
> >
> > This large RAM usage happened for all files that I tested, there was
> > nothing particurarly special about them. It is mostly products details
> from
> > warehouses:
> > ID SKU EAN NAME DESC QNT PRICE TAX IMAGE this type of elements. Some of
> > them are attributes (price, tax, suggested price as attributes in price
> > element) but this is mostly text data inside elements. Product
> descriptions
> > usually take most space.
> >
> > I will do some more experiments on this tomorrow and try tu measure when
> > exactly RAM usage increases. I am pretty sure it is in 'elm att val'=. x
> >
> > Spike was probably bad word, there is constant increase in RAM usage,
> like
> > there is no freeing of any memory before parsing finishes. Once it
> > finishes, memory is freed or my system crashes if file is large enough
> (or
> > i kill process before crash). The only asignement I have is this line
> that
> > I mentioned, it seems to be required to run expat_parse_xml
> >
> > Best regards,
> > M.G.
> >
> > W dniu wtorek, 7 września 2021 Raul Miller <[email protected]>
> > napisał(a):
> >
> > > Looking at the implementation, it's literally setting up the expat
> > > parser to parse your file and then responding to your callbacks.
> > >
> > > If 'elm att val'=. x is creating a memory spike for you, that suggests
> > > that your test file contains extremely long xml elements. (Not text
> > > data, not nested xml elements, but probably extremely long lists of
> > > attributes.)
> > >
> > > That memory is freeable when your verb execution instance exits.
> > >
> > > That said, I haven't studied expat closely, and it looks like the
> > > implementation expects that expat will free memory provided as
> > > callback arguments. If this is not the case, that would cause some
> > > issues.
> > >
> > > Thanks,
> > >
> > > --
> > > Raul
> > >
> > >
> > > On Tue, Sep 7, 2021 at 7:57 AM Mariusz Grasko <
> [email protected]>
> > > wrote:
> > > >
> > > > So after some break and experimenting with other stuff I have
> revisited
> > > > api/expat. It seems that expat, the C library for stream-oriented XML
> > > > parsing, should not load elements to RAM and only process as you go,
> > > > yielding tokens like StartElements, chardata, attributes etc. In
> this J
> > > > library it seems to not be a case, unless I am misunderstanding how
> to
> > > use
> > > > it (which is most likely explanation).
> > > >
> > > > This is my version of dumb parser that should do nothing, just pass
> > > through
> > > > a file without capturing any data inside variables:
> > > >
> > > > NB. PROGRAM START
> > > >
> > > > require 'api/expat'
> > > >
> > > > coinsert 'jexpat'
> > > >
> > > >
> > > > expat_initx=: 3 : 0
> > > >
> > > > id_offset=: y
> > > >
> > > > elements=: 0 0$''
> > > >
> > > > idnames=: 0$<''
> > > >
> > > > parents=: 0$0
> > > >
> > > > )
> > > >
> > > >
> > > > expat_start_elementx=: 4 : 0
> > > >
> > > > 'elm att val'=. x
> > > >
> > > > smoutput 7!:0''
> > > >
> > > > EMPTY
> > > >
> > > > )
> > > >
> > > >
> > > > expat_end_elementx=: 3 : 0
> > > >
> > > > EMPTY
> > > >
> > > > )
> > > >
> > > >
> > > > expat_parse_xmlx=: 3 : 0
> > > >
> > > > EMPTY
> > > >
> > > > )
> > > >
> > > >
> > > > 1 expat_parse_xml 1!:1 <
> > > > 'C:\Users\mariusz\Documents\Programing\J\parseXML\test.xml'
> > > >
> > > >
> > > > smoutput 'FINISH'
> > > >
> > > > NB. PROGRAM END
> > > >
> > > >
> > > > I have added smoutput 7!:0'' to confirm memory usage increase. I
> already
> > > > knew that because my large file would eventually crash Windows. Small
> > > files
> > > > take small amount of memory and larger files take more (not sure if
> it
> > > > linear realationship, haven't tested it). Enough to say that 296MB
> file
> > > > would eventually get over 6GB of RAM usage.
> > > >
> > > > My quesses as someone who is just starting J:
> > > >
> > > > 1). 'elm att val'=. x this line is responsible for this memory spike,
> > > maybe
> > > > there is a way to free up those variables before parsing next start
> > > element
> > > > ?
> > > >
> > > > 2). There is some accumulation happening befind the curtains maybe
> for
> > > > buffering and speeding up purposes - is there a way to limit it or
> > > reclaim
> > > > memory ?
> > > >
> > > > 3). My dumb parser is actually not just running through file and does
> > > > something which I don't understand that causes it to capture data.
> > > >
> > > > 4). This impelenation of J expat is not stream-parser it just
> utilizes
> > > > Expat, but loads whole document to memory beforehand. (but then why
> does
> > > > memory increase take time to read next element which is confirmed by
> > > > smoutput 7!:0''
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > M.G.
> > > >
> > > > wt., 17 sie 2021 o 20:45 Jan-Pieter Jacobs <
> [email protected]>
> > > > napisał(a):
> > > >
> > > > > Just as no one has not been mentioned before: Memory-mapped files
> also
> > > work
> > > > > very well for big data files.
> > > > >
> > > > >
> > > > > See the "Mapped Files" lab (Help > Studio > Labs > Mapped Files)
> and
> > > > > "doc_jmf_ [ load'jmf' " in a J session for a short summary.
> > > > >
> > > > > You can map any file as character data with
> > > > >
> > > > > JCHAR map_jmf_ 'var';'filename.csv'
> > > > >
> > > > > Now the variable 'var' will look as if it contains all your data.
> > > > > 2 warnings though:
> > > > > 1) Mind the mode: the default mode to map a file is RW, so changes
> to
> > > the
> > > > > variable are immediately written to disk. You can use MTRO_jmf_ as
> > > mode for
> > > > > avoiding this kind of mistakes.
> > > > > 2) Watch out with operations that might copy a big chunk of the
> > > variable,
> > > > > (e.g. var2=:var ; var2=: }. var ; ...) which could make your
> program,
> > > or
> > > > > others crash due to memory exhaustion.
> > > > >   Things like this do work well though:
> > > > >   LF +/@:= var NB. number of LF's, i.e. line count
> > > > >
> > > > > I once used them to grok through some 4GB CSV's to split them into
> > > records,
> > > > > which went quite satisfactorily. To make matters easier for myself
> > > > > afterwards, I wrote them out just sequentially, recording in a
> normal
> > > array
> > > > > a list of the starting point and length pairs of each records for
> easy
> > > > > indexing into the files when mapped again using map_jmf_ : E.g. get
> > > record
> > > > > n using: recordsJMF {~ (+i.)/ n { start_len ; or even: recordsJMF
> > > ({:@] {.
> > > > > {.@] }. [) n { start_len. The last is in recent versions of J far
> more
> > > > > efficient for long lengths due to using virtual nouns (see
> > > > >
> > > > > https://code.jsoftware.com/wiki/Vocabulary/
> > > SpecialCombinations#Virtual_Nouns
> > > > > )
> > > > >
> > > > > Of course, what makes sense to you probably depends on your
> situation,
> > > e.g.
> > > > > whether the data is going to change, how many times you intend to
> use
> > > the
> > > > > same data, ...
> > > > >
> > > > > Best regards,
> > > > > Jan-Pieter
> > > > >
> > > > > Op di 17 aug. 2021 om 18:56 schreef Devon McCormick <
> > > [email protected]>:
> > > > >
> > > > > > Hi Mariuz,
> > > > > > A while back I wrote some adverbs to apply a verb across a file
> in
> > > > > pieces:
> > > > > > https://github.com/DevonMcC/JUtilities/blob/master/
> > > workOnLargeFile.ijs .
> > > > > >
> > > > > > The simplest, most general one is "doSomethingSimple".  It
> applies
> > > the
> > > > > > supplied verb to successive chunks of the file and allows work
> > > already
> > > > > done
> > > > > > to be passed on the next iteration.
> > > > > >
> > > > > > NB.* doSomethingSimple: apply verb to file making minimal
> assumptions
> > > > > > about file structure.
> > > > > > doSomethingSimple=: 1 : 0
> > > > > >    'curptr chsz max flnm passedOn'=. 5{.y
> > > > > >    if. curptr>:max do. ch=. curptr;chsz;max;flnm
> > > > > >    else. ch=. readChunk curptr;chsz;max;flnm
> > > > > >        passedOn=. u (_1{ch),<passedOn  NB. Allow u's work to be
> > > passed on
> > > > > > to next invocation
> > > > > >    end.
> > > > > >    (4{.ch),<passedOn
> > > > > > NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize
> > > > > > 'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in
> > > file.
> > > > > > )
> > > > > >
> > > > > > The sub-function "readChunk" looks like this:
> > > > > >
> > > > > > readChunk=: 3 : 0
> > > > > >    'curptr chsz max flnm'=. 4{.y
> > > > > >    if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread
> > > flnm;curptr,chsz2
> > > > > >    else. chunk=. '' end.
> > > > > >    (curptr+chsz2);chsz2;max;flnm;chunk
> > > > > > NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize
> > > > > > 'bigFile.txt');'bigFile.txt'
> > > > > > )
> > > > > >
> > > > > > Another adverb "doSomething" is similar but assumes you have
> > > something
> > > > > like
> > > > > > line delimiters and you only want to process complete lines each
> time
> > > > > > through.
> > > > > >
> > > > > > If you get a chance to take a look at these, please let me know
> what
> > > you
> > > > > > think.
> > > > > >
> > > > > > Good luck,
> > > > > >
> > > > > > Devon
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected]
> >
> > > wrote:
> > > > > >
> > > > > > > Mariuz,
> > > > > > >
> > > > > > > I've used the following adverb (see below) to process 4gig
> CSVs.
> > > > > > Basically
> > > > > > > it works
> > > > > > > through the file in byte chunks.  As the j forum email tends to
> > > wreak
> > > > > > > embedded
> > > > > > > code you can see how this adv is used in the database ETL
> system
> > > that
> > > > > > uses
> > > > > > > it
> > > > > > > here:
> > > > > > >
> > > > > > > https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf
> > > > > > >
> > > > > > > You might also find this amusing:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > > https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-
> > > transform-and-load/
> > > > > > >
> > > > > > > ireadapply=:1 : 0
> > > > > > >
> > > > > > >
> > > > > > > NB.*ireadapply v-- apply verb (u) to n byte line blocks of
> static
> > > file.
> > > > > > >
> > > > > > > NB.
> > > > > > >
> > > > > > > NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ;
> iaBlockSize
> > > ;<
> > > > > > > uuData)
> > > > > > >
> > > > > > > NB.
> > > > > > >
> > > > > > > NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv'
> > > > > > >
> > > > > > > NB. fo=. SwiftTsvDir,'land_ItemSales.txt'
> > > > > > >
> > > > > > > NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply
> > > fi;fo;CRLF;20000000;<''
> > > > > > >
> > > > > > >
> > > > > > > NB. file in, file out, line delimiter, block size, (u) verb
> data
> > > > > > >
> > > > > > > 'fi fo d k ud'=. y
> > > > > > >
> > > > > > >
> > > > > > > p=. 0 NB. file pointer
> > > > > > >
> > > > > > > c=. 0 NB. block count
> > > > > > >
> > > > > > > s=. fsize fi NB. file bytes
> > > > > > >
> > > > > > > k=. k<.s NB. first block size
> > > > > > >
> > > > > > > NB.debug. b=. i.0 NB. block sizes (chk)
> > > > > > >
> > > > > > >
> > > > > > > while. p < s do.
> > > > > > >
> > > > > > > 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k
> > > > > > >
> > > > > > > c=. >:c NB. block count
> > > > > > >
> > > > > > > NB. complete lines
> > > > > > >
> > > > > > > if. 0 = #l=. d beforelaststr r do.
> > > > > > >
> > > > > > > NB. final shard
> > > > > > >
> > > > > > > NB.debug. b=. b,#r
> > > > > > >
> > > > > > > u c;1;d;fo;r;<ud break.
> > > > > > >
> > > > > > > end.
> > > > > > >
> > > > > > > p=. p + #l NB. inc file pointer
> > > > > > >
> > > > > > > k=. k <. s - p NB. next block size
> > > > > > >
> > > > > > > NB.debug. b=. b,#l NB. block sizes list
> > > > > > >
> > > > > > > NB. block number, shard, delimiter, file out, line bytes, (u)
> data
> > > > > > >
> > > > > > > u c;0;d;fo;l;<ud
> > > > > > >
> > > > > > > end.
> > > > > > >
> > > > > > >
> > > > > > > NB.debug. 'byte mismatch' assert s = +/b
> > > > > > >
> > > > > > > c NB. blocks processed
> > > > > > >
> > > > > > > )
> > > > > > >
> > > > > > > On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <
> [email protected]
> > > >
> > > > > > wrote:
> > > > > > >
> > > > > > > > 1, As you have noticed, certainly. There's details, of course
> > > (what
> > > > > > > > block size to use? Are files guaranteed to be well formed? If
> > > not,
> > > > > > > > what are error conditions? (are certain characters illegal?
> Are
> > > lines
> > > > > > > > longer than the block size allowed?) Do you want a callback
> > > interface
> > > > > > > > for each block? If so, do you need an "end of file"
> indication?
> > > If
> > > > > so,
> > > > > > > > is that a separate callback or a distinct argument to the
> block
> > > > > > > > callback? etc.)
> > > > > > > >
> > > > > > > > 2. Again, as you have noticed: yes. And, there are analogous
> > > details
> > > > > > > > here...
> > > > > > > >
> > > > > > > > 3. The expat API should only require J knowledge. There are a
> > > couple
> > > > > > > > examples in the addons/api/expat/test/ directory named
> test0.ijs
> > > and
> > > > > > > > test1.ijs
> > > > > > > >
> > > > > > > > I hope this helps,
> > > > > > > >
> > > > > > > > --
> > > > > > > > Raul
> > > > > > > >
> > > > > > > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko
> > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Thank you for some ideas on using external parser.
> > > > > > > > > Okay now I have 3 questions:
> > > > > > > > > 1. Is it possible to read CSV file streaming-style (for
> example
> > > > > > record
> > > > > > > by
> > > > > > > > > record) without loading everything in memory ? Even if I
> use
> > > some
> > > > > > > > external
> > > > > > > > > parsing solution like XSLT or just write something myself
> in
> > > some
> > > > > > other
> > > > > > > > > language than J, I will end up with large CSV instead of
> large
> > > XML.
> > > > > > It
> > > > > > > > > makes no difference. The reason that I need to parse it
> like
> > > this,
> > > > > is
> > > > > > > > that
> > > > > > > > > there are some rows that I won't need, those would be
> discarded
> > > > > > > depending
> > > > > > > > > on their field values.
> > > > > > > > > If it is not possible I would do more work outside of J in
> this
> > > > > first
> > > > > > > > > parser XML -> CSV.
> > > > > > > > > 2. Is there a way to call external program for J script ?
> If
> > > it is
> > > > > > > > > possible  to wait for it to finish ?
> > > > > > > > > If it is not possible, there are definiately ways to run J
> from
> > > > > other
> > > > > > > > > programs.
> > > > > > > > > 3. Can someone give a little bit of pointer or on how to
> use
> > > > > > api/expat
> > > > > > > > > library ? Do I need to familiarize myself with expat (C
> > > library) or
> > > > > > > just
> > > > > > > > > good understanding of J and reading small test in package
> > > directory
> > > > > > > > should
> > > > > > > > > be enough ?
> > > > > > > > > I could send some example file like Devon McCormick
> suggested.
> > > > > > > > >
> > > > > > > > > Right now I am working through book "J:The natural
> language for
> > > > > > > analytic
> > > > > > > > > computing" and playing around with problems like Project
> Euler,
> > > > > but I
> > > > > > > > could
> > > > > > > > > really see myself using J in serious work.
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > MG
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > śr., 11 sie 2021 o 09:51 <[email protected]>
> napisał(a):
> > > > > > > > >
> > > > > > > > > > In similar situations -but my files are not huge- I
> extract
> > > what
> > > > > I
> > > > > > > want
> > > > > > > > > > into flattened CSV using one or more XQuery scripts, and
> then
> > > > > load
> > > > > > > the
> > > > > > > > CSV
> > > > > > > > > > files with J.  The code is clean, compact and easy to
> > > maintain.
> > > > > For
> > > > > > > > > > recurrent XQuery patterns, m4 occasionally comes to the
> > > rescue.
> > > > > > > Expect
> > > > > > > > > > minor portability issues when using different XQuery
> > > processors
> > > > > > > > > > (extensions, language level...).
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Never got round to SAX parsing beyond tutorials, so I
> cannot
> > > > > > compare.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > De : Mariusz Grasko <[email protected]>
> > > > > > > > > > À : [email protected]
> > > > > > > > > > Sujet : [Jprogramming] Is is good idea to use J for
> reading
> > > large
> > > > > > XML
> > > > > > > > > > files ?
> > > > > > > > > > Date : 10/08/2021 18:05:45 Europe/Paris
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > We are ecommerce company and have a lot of integrations
> with
> > > > > > > suppliers,
> > > > > > > > > > products info is nearly always in XML files. I am
> thinking
> > > about
> > > > > > > using
> > > > > > > > J as
> > > > > > > > > > an analysis tool, do you think that working with large
> files
> > > that
> > > > > > > need
> > > > > > > > to
> > > > > > > > > > be parsed SAX- style without reading everything at once
> is
> > > good
> > > > > > idea
> > > > > > > > in J ?
> > > > > > > > > > Also is this even advantageous (as in, would code be
> terse).
> > > > > Right
> > > > > > > now
> > > > > > > > XML
> > > > > > > > > > parsing is done in Golang, so if parsing in J is not very
> > > good we
> > > > > > > > could try
> > > > > > > > > > to rely more on CSV exports. CSV is definiately very
> good in
> > > J.
> > > > > > > > > > I am hoping that maybe XML parsing is very good in J and
> the
> > > code
> > > > > > > would
> > > > > > > > > > become much smaller, if this is the case, then I would
> think
> > > > > about
> > > > > > > > using J
> > > > > > > > > > for XMLs with new suppliers.
> > > > > > > > > >
> > > > > > > > > > Best Regards
> > > > > > > > > > M.G.
> > > > > > > > > >
> > > > > > > ------------------------------------------------------------
> > > ----------
> > > > > > > > > > For information about J forums see
> > > > > > > http://www.jsoftware.com/forums.htm
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > ------------------------------------------------------------
> > > ----------
> > > > > > > > > > For information about J forums see
> > > > > > > http://www.jsoftware.com/forums.htm
> > > > > > > > > >
> > > > > > > > >
> > > > > > ------------------------------------------------------------
> > > ----------
> > > > > > > > > For information about J forums see
> > > > > > http://www.jsoftware.com/forums.htm
> > > > > > > >
> > > > >
> ----------------------------------------------------------------------
> > > > > > > > For information about J forums see
> > > > > http://www.jsoftware.com/forums.htm
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > > John D. Baker
> > > > > > > [email protected]
> > > > > > > ------------------------------------------------------------
> > > ----------
> > > > > > > For information about J forums see http://www.jsoftware.com/
> > > forums.htm
> > > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > >
> > > > > > Devon McCormick, CFA
> > > > > >
> > > > > > Quantitative Consultant
> > > > > > ------------------------------------------------------------
> > > ----------
> > > > > > For information about J forums see http://www.jsoftware.com/
> > > forums.htm
> > > > > >
> > > > >
> ----------------------------------------------------------------------
> > > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > > >
> > > >
> ----------------------------------------------------------------------
> > > > For information about J forums see
> http://www.jsoftware.com/forums.htm
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Reply via email to