Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Mariusz Grasko Thu, 09 Sep 2021 03:40:04 -0700

Repasting it with small fix and without my comments.

expat_parse_file=: 3 : 0


'' expat_parse_file y

:

expat_init x

parser=. XML_ParserCreate <<0

f=. [: 15!:13 (IFWIN#'+') , ' x' $~ +:@>:

XML_SetElementHandler parser, (f 3), (f 2)

XML_SetCharacterDataHandler parser, (f 4)


size =. 1!:4 y [ pos=. 0 [ inc=. 2^24

while. pos < size do.

if. (size-pos+inc) < 0 do.

block=. 1!:11 (>y); pos, size-pos

end_flag=.XML_TRUE

else.

block=. 1!:11 (>y); pos, inc

end_flag=.XML_FALSE

end.

if. XML_STATUS_ERROR = XML_Parse parser; block; (PARLEN=: #block); end_flag
do.

err=. memr 0 _1 2,~ XML_ErrorString XML_GetErrorCode parser

lncol=. (XML_GetCurrentLineNumber parser), XML_GetCurrentColumnNumber parser

XML_ParserFree parser

end.

pos=. pos+inc

end.

XML_ParserFree parser

expat_parse_xmlx''

)

czw., 9 wrz 2021 o 12:36 Mariusz Grasko <[email protected]>
napisał(a):

> Raul,
>
> Thank you very much for helping me out ! I have managed to tweak expat.jis
> so it now works stream-parsing style. It took me a long time to realise the
> fact that api/expat/expat.jis in addons directory was being constatnly
> overwritten by J to default version all the time. So now I just load my
> Expat version with 0!:1 at the start of program, I will read up more on
> load and require verbs later.
> Anyways now I have this (I also have a version with block size as an
> argument to tweak for verb user 'path inc' =. y):
>
> expat_parse_file=: 3 : 0
>
> '' expat_parse_file y
>
> :
>
> expat_init x
>
> parser=. XML_ParserCreate <<0
>
> f=. [: 15!:13 (IFWIN#'+') , ' x' $~ +:@>:
>
> XML_SetElementHandler parser, (f 3), (f 2)
>
> XML_SetCharacterDataHandler parser, (f 4)
>
>
> NB. Here my changes start: NB. filesize, position for indexed read, index
> incrementor:
>
> size =. 1!:4 y [ pos=. 0 [ inc=. 2^24
>
> while. pos < size do.
>
> if. (size-pos+inc) < 0 do. NB. End of file, just read the end and set flag
> to XML_TRUE
>
> block=. 1!:11 (>y); pos, size-pos
>
> end_flag=.XML_TRUE
>
> else. NB. Before end, read next block
>
> block=. 1!:11 (>y); pos, inc
>
> end_flag=.XML_FALSE
>
> end.
>
> if. XML_STATUS_ERROR = XML_Parse parser; block; (PARLEN=: #block);
> XML_TRUE do.
>
> err=. memr 0 _1 2,~ XML_ErrorString XML_GetErrorCode parser
>
> lncol=. (XML_GetCurrentLineNumber parser), XML_GetCurrentColumnNumber
> parser
>
> XML_ParserFree parser
>
> end.
>
> pos=. pos+inc
>
> end.
>
> XML_ParserFree parser
>
> expat_parse_xmlx''
>
> )
>
> Please let me know if you think this is acceptable, I can make a pull
> request.
>
> Best regards,
> M.G.
>
> śr., 8 wrz 2021 o 12:32 bill lam <[email protected]> napisał(a):
>
>> IIRC expat uses the push model to get elements and should be memory
>> efficient. Please try the approach suggested by Raul to see if there will
>> be any improvement.
>>
>> On Tue, Sep 7, 2021, 7:57 PM Mariusz Grasko <[email protected]>
>> wrote:
>>
>> > So after some break and experimenting with other stuff I have revisited
>> > api/expat. It seems that expat, the C library for stream-oriented XML
>> > parsing, should not load elements to RAM and only process as you go,
>> > yielding tokens like StartElements, chardata, attributes etc. In this J
>> > library it seems to not be a case, unless I am misunderstanding how to
>> use
>> > it (which is most likely explanation).
>> >
>> > This is my version of dumb parser that should do nothing, just pass
>> through
>> > a file without capturing any data inside variables:
>> >
>> > NB. PROGRAM START
>> >
>> > require 'api/expat'
>> >
>> > coinsert 'jexpat'
>> >
>> >
>> > expat_initx=: 3 : 0
>> >
>> > id_offset=: y
>> >
>> > elements=: 0 0$''
>> >
>> > idnames=: 0$<''
>> >
>> > parents=: 0$0
>> >
>> > )
>> >
>> >
>> > expat_start_elementx=: 4 : 0
>> >
>> > 'elm att val'=. x
>> >
>> > smoutput 7!:0''
>> >
>> > EMPTY
>> >
>> > )
>> >
>> >
>> > expat_end_elementx=: 3 : 0
>> >
>> > EMPTY
>> >
>> > )
>> >
>> >
>> > expat_parse_xmlx=: 3 : 0
>> >
>> > EMPTY
>> >
>> > )
>> >
>> >
>> > 1 expat_parse_xml 1!:1 <
>> > 'C:\Users\mariusz\Documents\Programing\J\parseXML\test.xml'
>> >
>> >
>> > smoutput 'FINISH'
>> >
>> > NB. PROGRAM END
>> >
>> >
>> > I have added smoutput 7!:0'' to confirm memory usage increase. I already
>> > knew that because my large file would eventually crash Windows. Small
>> files
>> > take small amount of memory and larger files take more (not sure if it
>> > linear realationship, haven't tested it). Enough to say that 296MB file
>> > would eventually get over 6GB of RAM usage.
>> >
>> > My quesses as someone who is just starting J:
>> >
>> > 1). 'elm att val'=. x this line is responsible for this memory spike,
>> maybe
>> > there is a way to free up those variables before parsing next start
>> element
>> > ?
>> >
>> > 2). There is some accumulation happening befind the curtains maybe for
>> > buffering and speeding up purposes - is there a way to limit it or
>> reclaim
>> > memory ?
>> >
>> > 3). My dumb parser is actually not just running through file and does
>> > something which I don't understand that causes it to capture data.
>> >
>> > 4). This impelenation of J expat is not stream-parser it just utilizes
>> > Expat, but loads whole document to memory beforehand. (but then why does
>> > memory increase take time to read next element which is confirmed by
>> > smoutput 7!:0''
>> >
>> >
>> > Best regards,
>> >
>> > M.G.
>> >
>> > wt., 17 sie 2021 o 20:45 Jan-Pieter Jacobs <[email protected]>
>> > napisał(a):
>> >
>> > > Just as no one has not been mentioned before: Memory-mapped files also
>> > work
>> > > very well for big data files.
>> > >
>> > >
>> > > See the "Mapped Files" lab (Help > Studio > Labs > Mapped Files) and
>> > > "doc_jmf_ [ load'jmf' " in a J session for a short summary.
>> > >
>> > > You can map any file as character data with
>> > >
>> > > JCHAR map_jmf_ 'var';'filename.csv'
>> > >
>> > > Now the variable 'var' will look as if it contains all your data.
>> > > 2 warnings though:
>> > > 1) Mind the mode: the default mode to map a file is RW, so changes to
>> the
>> > > variable are immediately written to disk. You can use MTRO_jmf_ as
>> mode
>> > for
>> > > avoiding this kind of mistakes.
>> > > 2) Watch out with operations that might copy a big chunk of the
>> variable,
>> > > (e.g. var2=:var ; var2=: }. var ; ...) which could make your program,
>> or
>> > > others crash due to memory exhaustion.
>> > >   Things like this do work well though:
>> > >   LF +/@:= var NB. number of LF's, i.e. line count
>> > >
>> > > I once used them to grok through some 4GB CSV's to split them into
>> > records,
>> > > which went quite satisfactorily. To make matters easier for myself
>> > > afterwards, I wrote them out just sequentially, recording in a normal
>> > array
>> > > a list of the starting point and length pairs of each records for easy
>> > > indexing into the files when mapped again using map_jmf_ : E.g. get
>> > record
>> > > n using: recordsJMF {~ (+i.)/ n { start_len ; or even: recordsJMF
>> ({:@]
>> > {.
>> > > {.@] }. [) n { start_len. The last is in recent versions of J far more
>> > > efficient for long lengths due to using virtual nouns (see
>> > >
>> > >
>> >
>> https://code.jsoftware.com/wiki/Vocabulary/SpecialCombinations#Virtual_Nouns
>> > > )
>> > >
>> > > Of course, what makes sense to you probably depends on your situation,
>> > e.g.
>> > > whether the data is going to change, how many times you intend to use
>> the
>> > > same data, ...
>> > >
>> > > Best regards,
>> > > Jan-Pieter
>> > >
>> > > Op di 17 aug. 2021 om 18:56 schreef Devon McCormick <
>> [email protected]
>> > >:
>> > >
>> > > > Hi Mariuz,
>> > > > A while back I wrote some adverbs to apply a verb across a file in
>> > > pieces:
>> > > >
>> https://github.com/DevonMcC/JUtilities/blob/master/workOnLargeFile.ijs
>> > .
>> > > >
>> > > > The simplest, most general one is "doSomethingSimple".  It applies
>> the
>> > > > supplied verb to successive chunks of the file and allows work
>> already
>> > > done
>> > > > to be passed on the next iteration.
>> > > >
>> > > > NB.* doSomethingSimple: apply verb to file making minimal
>> assumptions
>> > > > about file structure.
>> > > > doSomethingSimple=: 1 : 0
>> > > >    'curptr chsz max flnm passedOn'=. 5{.y
>> > > >    if. curptr>:max do. ch=. curptr;chsz;max;flnm
>> > > >    else. ch=. readChunk curptr;chsz;max;flnm
>> > > >        passedOn=. u (_1{ch),<passedOn  NB. Allow u's work to be
>> passed
>> > on
>> > > > to next invocation
>> > > >    end.
>> > > >    (4{.ch),<passedOn
>> > > > NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize
>> > > > 'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in
>> file.
>> > > > )
>> > > >
>> > > > The sub-function "readChunk" looks like this:
>> > > >
>> > > > readChunk=: 3 : 0
>> > > >    'curptr chsz max flnm'=. 4{.y
>> > > >    if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread
>> > flnm;curptr,chsz2
>> > > >    else. chunk=. '' end.
>> > > >    (curptr+chsz2);chsz2;max;flnm;chunk
>> > > > NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize
>> > > > 'bigFile.txt');'bigFile.txt'
>> > > > )
>> > > >
>> > > > Another adverb "doSomething" is similar but assumes you have
>> something
>> > > like
>> > > > line delimiters and you only want to process complete lines each
>> time
>> > > > through.
>> > > >
>> > > > If you get a chance to take a look at these, please let me know what
>> > you
>> > > > think.
>> > > >
>> > > > Good luck,
>> > > >
>> > > > Devon
>> > > >
>> > > >
>> > > >
>> > > > On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected]>
>> > wrote:
>> > > >
>> > > > > Mariuz,
>> > > > >
>> > > > > I've used the following adverb (see below) to process 4gig CSVs.
>> > > > Basically
>> > > > > it works
>> > > > > through the file in byte chunks.  As the j forum email tends to
>> wreak
>> > > > > embedded
>> > > > > code you can see how this adv is used in the database ETL system
>> that
>> > > > uses
>> > > > > it
>> > > > > here:
>> > > > >
>> > > > > https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf
>> > > > >
>> > > > > You might also find this amusing:
>> > > > >
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-transform-and-load/
>> > > > >
>> > > > > ireadapply=:1 : 0
>> > > > >
>> > > > >
>> > > > > NB.*ireadapply v-- apply verb (u) to n byte line blocks of static
>> > file.
>> > > > >
>> > > > > NB.
>> > > > >
>> > > > > NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize
>> ;<
>> > > > > uuData)
>> > > > >
>> > > > > NB.
>> > > > >
>> > > > > NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv'
>> > > > >
>> > > > > NB. fo=. SwiftTsvDir,'land_ItemSales.txt'
>> > > > >
>> > > > > NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply
>> fi;fo;CRLF;20000000;<''
>> > > > >
>> > > > >
>> > > > > NB. file in, file out, line delimiter, block size, (u) verb data
>> > > > >
>> > > > > 'fi fo d k ud'=. y
>> > > > >
>> > > > >
>> > > > > p=. 0 NB. file pointer
>> > > > >
>> > > > > c=. 0 NB. block count
>> > > > >
>> > > > > s=. fsize fi NB. file bytes
>> > > > >
>> > > > > k=. k<.s NB. first block size
>> > > > >
>> > > > > NB.debug. b=. i.0 NB. block sizes (chk)
>> > > > >
>> > > > >
>> > > > > while. p < s do.
>> > > > >
>> > > > > 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k
>> > > > >
>> > > > > c=. >:c NB. block count
>> > > > >
>> > > > > NB. complete lines
>> > > > >
>> > > > > if. 0 = #l=. d beforelaststr r do.
>> > > > >
>> > > > > NB. final shard
>> > > > >
>> > > > > NB.debug. b=. b,#r
>> > > > >
>> > > > > u c;1;d;fo;r;<ud break.
>> > > > >
>> > > > > end.
>> > > > >
>> > > > > p=. p + #l NB. inc file pointer
>> > > > >
>> > > > > k=. k <. s - p NB. next block size
>> > > > >
>> > > > > NB.debug. b=. b,#l NB. block sizes list
>> > > > >
>> > > > > NB. block number, shard, delimiter, file out, line bytes, (u) data
>> > > > >
>> > > > > u c;0;d;fo;l;<ud
>> > > > >
>> > > > > end.
>> > > > >
>> > > > >
>> > > > > NB.debug. 'byte mismatch' assert s = +/b
>> > > > >
>> > > > > c NB. blocks processed
>> > > > >
>> > > > > )
>> > > > >
>> > > > > On Mon, Aug 16, 2021 at 7:17 PM Raul Miller <
>> [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > > 1, As you have noticed, certainly. There's details, of course
>> (what
>> > > > > > block size to use? Are files guaranteed to be well formed? If
>> not,
>> > > > > > what are error conditions? (are certain characters illegal? Are
>> > lines
>> > > > > > longer than the block size allowed?) Do you want a callback
>> > interface
>> > > > > > for each block? If so, do you need an "end of file" indication?
>> If
>> > > so,
>> > > > > > is that a separate callback or a distinct argument to the block
>> > > > > > callback? etc.)
>> > > > > >
>> > > > > > 2. Again, as you have noticed: yes. And, there are analogous
>> > details
>> > > > > > here...
>> > > > > >
>> > > > > > 3. The expat API should only require J knowledge. There are a
>> > couple
>> > > > > > examples in the addons/api/expat/test/ directory named test0.ijs
>> > and
>> > > > > > test1.ijs
>> > > > > >
>> > > > > > I hope this helps,
>> > > > > >
>> > > > > > --
>> > > > > > Raul
>> > > > > >
>> > > > > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko
>> > > > > > <[email protected]> wrote:
>> > > > > > >
>> > > > > > > Thank you for some ideas on using external parser.
>> > > > > > > Okay now I have 3 questions:
>> > > > > > > 1. Is it possible to read CSV file streaming-style (for
>> example
>> > > > record
>> > > > > by
>> > > > > > > record) without loading everything in memory ? Even if I use
>> some
>> > > > > > external
>> > > > > > > parsing solution like XSLT or just write something myself in
>> some
>> > > > other
>> > > > > > > language than J, I will end up with large CSV instead of large
>> > XML.
>> > > > It
>> > > > > > > makes no difference. The reason that I need to parse it like
>> > this,
>> > > is
>> > > > > > that
>> > > > > > > there are some rows that I won't need, those would be
>> discarded
>> > > > > depending
>> > > > > > > on their field values.
>> > > > > > > If it is not possible I would do more work outside of J in
>> this
>> > > first
>> > > > > > > parser XML -> CSV.
>> > > > > > > 2. Is there a way to call external program for J script ? If
>> it
>> > is
>> > > > > > > possible  to wait for it to finish ?
>> > > > > > > If it is not possible, there are definiately ways to run J
>> from
>> > > other
>> > > > > > > programs.
>> > > > > > > 3. Can someone give a little bit of pointer or on how to use
>> > > > api/expat
>> > > > > > > library ? Do I need to familiarize myself with expat (C
>> library)
>> > or
>> > > > > just
>> > > > > > > good understanding of J and reading small test in package
>> > directory
>> > > > > > should
>> > > > > > > be enough ?
>> > > > > > > I could send some example file like Devon McCormick suggested.
>> > > > > > >
>> > > > > > > Right now I am working through book "J:The natural language
>> for
>> > > > > analytic
>> > > > > > > computing" and playing around with problems like Project
>> Euler,
>> > > but I
>> > > > > > could
>> > > > > > > really see myself using J in serious work.
>> > > > > > >
>> > > > > > > Best regards,
>> > > > > > > MG
>> > > > > > >
>> > > > > > >
>> > > > > > > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a):
>> > > > > > >
>> > > > > > > > In similar situations -but my files are not huge- I extract
>> > what
>> > > I
>> > > > > want
>> > > > > > > > into flattened CSV using one or more XQuery scripts, and
>> then
>> > > load
>> > > > > the
>> > > > > > CSV
>> > > > > > > > files with J.  The code is clean, compact and easy to
>> maintain.
>> > > For
>> > > > > > > > recurrent XQuery patterns, m4 occasionally comes to the
>> rescue.
>> > > > > Expect
>> > > > > > > > minor portability issues when using different XQuery
>> processors
>> > > > > > > > (extensions, language level...).
>> > > > > > > >
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Never got round to SAX parsing beyond tutorials, so I cannot
>> > > > compare.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > De : Mariusz Grasko <[email protected]>
>> > > > > > > > À : [email protected]
>> > > > > > > > Sujet : [Jprogramming] Is is good idea to use J for reading
>> > large
>> > > > XML
>> > > > > > > > files ?
>> > > > > > > > Date : 10/08/2021 18:05:45 Europe/Paris
>> > > > > > > >
>> > > > > > > > Hi,
>> > > > > > > >
>> > > > > > > > We are ecommerce company and have a lot of integrations with
>> > > > > suppliers,
>> > > > > > > > products info is nearly always in XML files. I am thinking
>> > about
>> > > > > using
>> > > > > > J as
>> > > > > > > > an analysis tool, do you think that working with large files
>> > that
>> > > > > need
>> > > > > > to
>> > > > > > > > be parsed SAX- style without reading everything at once is
>> good
>> > > > idea
>> > > > > > in J ?
>> > > > > > > > Also is this even advantageous (as in, would code be terse).
>> > > Right
>> > > > > now
>> > > > > > XML
>> > > > > > > > parsing is done in Golang, so if parsing in J is not very
>> good
>> > we
>> > > > > > could try
>> > > > > > > > to rely more on CSV exports. CSV is definiately very good
>> in J.
>> > > > > > > > I am hoping that maybe XML parsing is very good in J and the
>> > code
>> > > > > would
>> > > > > > > > become much smaller, if this is the case, then I would think
>> > > about
>> > > > > > using J
>> > > > > > > > for XMLs with new suppliers.
>> > > > > > > >
>> > > > > > > > Best Regards
>> > > > > > > > M.G.
>> > > > > > > >
>> > > > >
>> > ----------------------------------------------------------------------
>> > > > > > > > For information about J forums see
>> > > > > http://www.jsoftware.com/forums.htm
>> > > > > > > >
>> > > > > > > >
>> > > > >
>> > ----------------------------------------------------------------------
>> > > > > > > > For information about J forums see
>> > > > > http://www.jsoftware.com/forums.htm
>> > > > > > > >
>> > > > > > >
>> > > >
>> ----------------------------------------------------------------------
>> > > > > > > For information about J forums see
>> > > > http://www.jsoftware.com/forums.htm
>> > > > > >
>> > > ----------------------------------------------------------------------
>> > > > > > For information about J forums see
>> > > http://www.jsoftware.com/forums.htm
>> > > > > >
>> > > > >
>> > > > >
>> > > > > --
>> > > > > John D. Baker
>> > > > > [email protected]
>> > > > >
>> > ----------------------------------------------------------------------
>> > > > > For information about J forums see
>> > http://www.jsoftware.com/forums.htm
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > >
>> > > > Devon McCormick, CFA
>> > > >
>> > > > Quantitative Consultant
>> > > >
>> ----------------------------------------------------------------------
>> > > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > > >
>> > > ----------------------------------------------------------------------
>> > > For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> > >
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> >
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Is is good idea to use J for reading large XML files ?

Reply via email to