Repasting it with small fix and without my comments. expat_parse_file=: 3 : 0
'' expat_parse_file y : expat_init x parser=. XML_ParserCreate <<0 f=. [: 15!:13 (IFWIN#'+') , ' x' $~ +:@>: XML_SetElementHandler parser, (f 3), (f 2) XML_SetCharacterDataHandler parser, (f 4) size =. 1!:4 y [ pos=. 0 [ inc=. 2^24 while. pos < size do. if. (size-pos+inc) < 0 do. block=. 1!:11 (>y); pos, size-pos end_flag=.XML_TRUE else. block=. 1!:11 (>y); pos, inc end_flag=.XML_FALSE end. if. XML_STATUS_ERROR = XML_Parse parser; block; (PARLEN=: #block); end_flag do. err=. memr 0 _1 2,~ XML_ErrorString XML_GetErrorCode parser lncol=. (XML_GetCurrentLineNumber parser), XML_GetCurrentColumnNumber parser XML_ParserFree parser end. pos=. pos+inc end. XML_ParserFree parser expat_parse_xmlx'' ) czw., 9 wrz 2021 o 12:36 Mariusz Grasko <[email protected]> napisał(a): > Raul, > > Thank you very much for helping me out ! I have managed to tweak expat.jis > so it now works stream-parsing style. It took me a long time to realise the > fact that api/expat/expat.jis in addons directory was being constatnly > overwritten by J to default version all the time. So now I just load my > Expat version with 0!:1 at the start of program, I will read up more on > load and require verbs later. > Anyways now I have this (I also have a version with block size as an > argument to tweak for verb user 'path inc' =. y): > > expat_parse_file=: 3 : 0 > > '' expat_parse_file y > > : > > expat_init x > > parser=. XML_ParserCreate <<0 > > f=. [: 15!:13 (IFWIN#'+') , ' x' $~ +:@>: > > XML_SetElementHandler parser, (f 3), (f 2) > > XML_SetCharacterDataHandler parser, (f 4) > > > NB. Here my changes start: NB. filesize, position for indexed read, index > incrementor: > > size =. 1!:4 y [ pos=. 0 [ inc=. 2^24 > > while. pos < size do. > > if. (size-pos+inc) < 0 do. NB. End of file, just read the end and set flag > to XML_TRUE > > block=. 1!:11 (>y); pos, size-pos > > end_flag=.XML_TRUE > > else. NB. Before end, read next block > > block=. 1!:11 (>y); pos, inc > > end_flag=.XML_FALSE > > end. > > if. XML_STATUS_ERROR = XML_Parse parser; block; (PARLEN=: #block); > XML_TRUE do. > > err=. memr 0 _1 2,~ XML_ErrorString XML_GetErrorCode parser > > lncol=. (XML_GetCurrentLineNumber parser), XML_GetCurrentColumnNumber > parser > > XML_ParserFree parser > > end. > > pos=. pos+inc > > end. > > XML_ParserFree parser > > expat_parse_xmlx'' > > ) > > Please let me know if you think this is acceptable, I can make a pull > request. > > Best regards, > M.G. > > śr., 8 wrz 2021 o 12:32 bill lam <[email protected]> napisał(a): > >> IIRC expat uses the push model to get elements and should be memory >> efficient. Please try the approach suggested by Raul to see if there will >> be any improvement. >> >> On Tue, Sep 7, 2021, 7:57 PM Mariusz Grasko <[email protected]> >> wrote: >> >> > So after some break and experimenting with other stuff I have revisited >> > api/expat. It seems that expat, the C library for stream-oriented XML >> > parsing, should not load elements to RAM and only process as you go, >> > yielding tokens like StartElements, chardata, attributes etc. In this J >> > library it seems to not be a case, unless I am misunderstanding how to >> use >> > it (which is most likely explanation). >> > >> > This is my version of dumb parser that should do nothing, just pass >> through >> > a file without capturing any data inside variables: >> > >> > NB. PROGRAM START >> > >> > require 'api/expat' >> > >> > coinsert 'jexpat' >> > >> > >> > expat_initx=: 3 : 0 >> > >> > id_offset=: y >> > >> > elements=: 0 0$'' >> > >> > idnames=: 0$<'' >> > >> > parents=: 0$0 >> > >> > ) >> > >> > >> > expat_start_elementx=: 4 : 0 >> > >> > 'elm att val'=. x >> > >> > smoutput 7!:0'' >> > >> > EMPTY >> > >> > ) >> > >> > >> > expat_end_elementx=: 3 : 0 >> > >> > EMPTY >> > >> > ) >> > >> > >> > expat_parse_xmlx=: 3 : 0 >> > >> > EMPTY >> > >> > ) >> > >> > >> > 1 expat_parse_xml 1!:1 < >> > 'C:\Users\mariusz\Documents\Programing\J\parseXML\test.xml' >> > >> > >> > smoutput 'FINISH' >> > >> > NB. PROGRAM END >> > >> > >> > I have added smoutput 7!:0'' to confirm memory usage increase. I already >> > knew that because my large file would eventually crash Windows. Small >> files >> > take small amount of memory and larger files take more (not sure if it >> > linear realationship, haven't tested it). Enough to say that 296MB file >> > would eventually get over 6GB of RAM usage. >> > >> > My quesses as someone who is just starting J: >> > >> > 1). 'elm att val'=. x this line is responsible for this memory spike, >> maybe >> > there is a way to free up those variables before parsing next start >> element >> > ? >> > >> > 2). There is some accumulation happening befind the curtains maybe for >> > buffering and speeding up purposes - is there a way to limit it or >> reclaim >> > memory ? >> > >> > 3). My dumb parser is actually not just running through file and does >> > something which I don't understand that causes it to capture data. >> > >> > 4). This impelenation of J expat is not stream-parser it just utilizes >> > Expat, but loads whole document to memory beforehand. (but then why does >> > memory increase take time to read next element which is confirmed by >> > smoutput 7!:0'' >> > >> > >> > Best regards, >> > >> > M.G. >> > >> > wt., 17 sie 2021 o 20:45 Jan-Pieter Jacobs <[email protected]> >> > napisał(a): >> > >> > > Just as no one has not been mentioned before: Memory-mapped files also >> > work >> > > very well for big data files. >> > > >> > > >> > > See the "Mapped Files" lab (Help > Studio > Labs > Mapped Files) and >> > > "doc_jmf_ [ load'jmf' " in a J session for a short summary. >> > > >> > > You can map any file as character data with >> > > >> > > JCHAR map_jmf_ 'var';'filename.csv' >> > > >> > > Now the variable 'var' will look as if it contains all your data. >> > > 2 warnings though: >> > > 1) Mind the mode: the default mode to map a file is RW, so changes to >> the >> > > variable are immediately written to disk. You can use MTRO_jmf_ as >> mode >> > for >> > > avoiding this kind of mistakes. >> > > 2) Watch out with operations that might copy a big chunk of the >> variable, >> > > (e.g. var2=:var ; var2=: }. var ; ...) which could make your program, >> or >> > > others crash due to memory exhaustion. >> > > Things like this do work well though: >> > > LF +/@:= var NB. number of LF's, i.e. line count >> > > >> > > I once used them to grok through some 4GB CSV's to split them into >> > records, >> > > which went quite satisfactorily. To make matters easier for myself >> > > afterwards, I wrote them out just sequentially, recording in a normal >> > array >> > > a list of the starting point and length pairs of each records for easy >> > > indexing into the files when mapped again using map_jmf_ : E.g. get >> > record >> > > n using: recordsJMF {~ (+i.)/ n { start_len ; or even: recordsJMF >> ({:@] >> > {. >> > > {.@] }. [) n { start_len. The last is in recent versions of J far more >> > > efficient for long lengths due to using virtual nouns (see >> > > >> > > >> > >> https://code.jsoftware.com/wiki/Vocabulary/SpecialCombinations#Virtual_Nouns >> > > ) >> > > >> > > Of course, what makes sense to you probably depends on your situation, >> > e.g. >> > > whether the data is going to change, how many times you intend to use >> the >> > > same data, ... >> > > >> > > Best regards, >> > > Jan-Pieter >> > > >> > > Op di 17 aug. 2021 om 18:56 schreef Devon McCormick < >> [email protected] >> > >: >> > > >> > > > Hi Mariuz, >> > > > A while back I wrote some adverbs to apply a verb across a file in >> > > pieces: >> > > > >> https://github.com/DevonMcC/JUtilities/blob/master/workOnLargeFile.ijs >> > . >> > > > >> > > > The simplest, most general one is "doSomethingSimple". It applies >> the >> > > > supplied verb to successive chunks of the file and allows work >> already >> > > done >> > > > to be passed on the next iteration. >> > > > >> > > > NB.* doSomethingSimple: apply verb to file making minimal >> assumptions >> > > > about file structure. >> > > > doSomethingSimple=: 1 : 0 >> > > > 'curptr chsz max flnm passedOn'=. 5{.y >> > > > if. curptr>:max do. ch=. curptr;chsz;max;flnm >> > > > else. ch=. readChunk curptr;chsz;max;flnm >> > > > passedOn=. u (_1{ch),<passedOn NB. Allow u's work to be >> passed >> > on >> > > > to next invocation >> > > > end. >> > > > (4{.ch),<passedOn >> > > > NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize >> > > > 'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in >> file. >> > > > ) >> > > > >> > > > The sub-function "readChunk" looks like this: >> > > > >> > > > readChunk=: 3 : 0 >> > > > 'curptr chsz max flnm'=. 4{.y >> > > > if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread >> > flnm;curptr,chsz2 >> > > > else. chunk=. '' end. >> > > > (curptr+chsz2);chsz2;max;flnm;chunk >> > > > NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize >> > > > 'bigFile.txt');'bigFile.txt' >> > > > ) >> > > > >> > > > Another adverb "doSomething" is similar but assumes you have >> something >> > > like >> > > > line delimiters and you only want to process complete lines each >> time >> > > > through. >> > > > >> > > > If you get a chance to take a look at these, please let me know what >> > you >> > > > think. >> > > > >> > > > Good luck, >> > > > >> > > > Devon >> > > > >> > > > >> > > > >> > > > On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected]> >> > wrote: >> > > > >> > > > > Mariuz, >> > > > > >> > > > > I've used the following adverb (see below) to process 4gig CSVs. >> > > > Basically >> > > > > it works >> > > > > through the file in byte chunks. As the j forum email tends to >> wreak >> > > > > embedded >> > > > > code you can see how this adv is used in the database ETL system >> that >> > > > uses >> > > > > it >> > > > > here: >> > > > > >> > > > > https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf >> > > > > >> > > > > You might also find this amusing: >> > > > > >> > > > > >> > > > > >> > > > >> > > >> > >> https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract-transform-and-load/ >> > > > > >> > > > > ireadapply=:1 : 0 >> > > > > >> > > > > >> > > > > NB.*ireadapply v-- apply verb (u) to n byte line blocks of static >> > file. >> > > > > >> > > > > NB. >> > > > > >> > > > > NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; iaBlockSize >> ;< >> > > > > uuData) >> > > > > >> > > > > NB. >> > > > > >> > > > > NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv' >> > > > > >> > > > > NB. fo=. SwiftTsvDir,'land_ItemSales.txt' >> > > > > >> > > > > NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply >> fi;fo;CRLF;20000000;<'' >> > > > > >> > > > > >> > > > > NB. file in, file out, line delimiter, block size, (u) verb data >> > > > > >> > > > > 'fi fo d k ud'=. y >> > > > > >> > > > > >> > > > > p=. 0 NB. file pointer >> > > > > >> > > > > c=. 0 NB. block count >> > > > > >> > > > > s=. fsize fi NB. file bytes >> > > > > >> > > > > k=. k<.s NB. first block size >> > > > > >> > > > > NB.debug. b=. i.0 NB. block sizes (chk) >> > > > > >> > > > > >> > > > > while. p < s do. >> > > > > >> > > > > 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k >> > > > > >> > > > > c=. >:c NB. block count >> > > > > >> > > > > NB. complete lines >> > > > > >> > > > > if. 0 = #l=. d beforelaststr r do. >> > > > > >> > > > > NB. final shard >> > > > > >> > > > > NB.debug. b=. b,#r >> > > > > >> > > > > u c;1;d;fo;r;<ud break. >> > > > > >> > > > > end. >> > > > > >> > > > > p=. p + #l NB. inc file pointer >> > > > > >> > > > > k=. k <. s - p NB. next block size >> > > > > >> > > > > NB.debug. b=. b,#l NB. block sizes list >> > > > > >> > > > > NB. block number, shard, delimiter, file out, line bytes, (u) data >> > > > > >> > > > > u c;0;d;fo;l;<ud >> > > > > >> > > > > end. >> > > > > >> > > > > >> > > > > NB.debug. 'byte mismatch' assert s = +/b >> > > > > >> > > > > c NB. blocks processed >> > > > > >> > > > > ) >> > > > > >> > > > > On Mon, Aug 16, 2021 at 7:17 PM Raul Miller < >> [email protected]> >> > > > wrote: >> > > > > >> > > > > > 1, As you have noticed, certainly. There's details, of course >> (what >> > > > > > block size to use? Are files guaranteed to be well formed? If >> not, >> > > > > > what are error conditions? (are certain characters illegal? Are >> > lines >> > > > > > longer than the block size allowed?) Do you want a callback >> > interface >> > > > > > for each block? If so, do you need an "end of file" indication? >> If >> > > so, >> > > > > > is that a separate callback or a distinct argument to the block >> > > > > > callback? etc.) >> > > > > > >> > > > > > 2. Again, as you have noticed: yes. And, there are analogous >> > details >> > > > > > here... >> > > > > > >> > > > > > 3. The expat API should only require J knowledge. There are a >> > couple >> > > > > > examples in the addons/api/expat/test/ directory named test0.ijs >> > and >> > > > > > test1.ijs >> > > > > > >> > > > > > I hope this helps, >> > > > > > >> > > > > > -- >> > > > > > Raul >> > > > > > >> > > > > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko >> > > > > > <[email protected]> wrote: >> > > > > > > >> > > > > > > Thank you for some ideas on using external parser. >> > > > > > > Okay now I have 3 questions: >> > > > > > > 1. Is it possible to read CSV file streaming-style (for >> example >> > > > record >> > > > > by >> > > > > > > record) without loading everything in memory ? Even if I use >> some >> > > > > > external >> > > > > > > parsing solution like XSLT or just write something myself in >> some >> > > > other >> > > > > > > language than J, I will end up with large CSV instead of large >> > XML. >> > > > It >> > > > > > > makes no difference. The reason that I need to parse it like >> > this, >> > > is >> > > > > > that >> > > > > > > there are some rows that I won't need, those would be >> discarded >> > > > > depending >> > > > > > > on their field values. >> > > > > > > If it is not possible I would do more work outside of J in >> this >> > > first >> > > > > > > parser XML -> CSV. >> > > > > > > 2. Is there a way to call external program for J script ? If >> it >> > is >> > > > > > > possible to wait for it to finish ? >> > > > > > > If it is not possible, there are definiately ways to run J >> from >> > > other >> > > > > > > programs. >> > > > > > > 3. Can someone give a little bit of pointer or on how to use >> > > > api/expat >> > > > > > > library ? Do I need to familiarize myself with expat (C >> library) >> > or >> > > > > just >> > > > > > > good understanding of J and reading small test in package >> > directory >> > > > > > should >> > > > > > > be enough ? >> > > > > > > I could send some example file like Devon McCormick suggested. >> > > > > > > >> > > > > > > Right now I am working through book "J:The natural language >> for >> > > > > analytic >> > > > > > > computing" and playing around with problems like Project >> Euler, >> > > but I >> > > > > > could >> > > > > > > really see myself using J in serious work. >> > > > > > > >> > > > > > > Best regards, >> > > > > > > MG >> > > > > > > >> > > > > > > >> > > > > > > śr., 11 sie 2021 o 09:51 <[email protected]> napisał(a): >> > > > > > > >> > > > > > > > In similar situations -but my files are not huge- I extract >> > what >> > > I >> > > > > want >> > > > > > > > into flattened CSV using one or more XQuery scripts, and >> then >> > > load >> > > > > the >> > > > > > CSV >> > > > > > > > files with J. The code is clean, compact and easy to >> maintain. >> > > For >> > > > > > > > recurrent XQuery patterns, m4 occasionally comes to the >> rescue. >> > > > > Expect >> > > > > > > > minor portability issues when using different XQuery >> processors >> > > > > > > > (extensions, language level...). >> > > > > > > > >> > > > > > > > >> > > > > > > > >> > > > > > > > Never got round to SAX parsing beyond tutorials, so I cannot >> > > > compare. >> > > > > > > > >> > > > > > > > >> > > > > > > > De : Mariusz Grasko <[email protected]> >> > > > > > > > À : [email protected] >> > > > > > > > Sujet : [Jprogramming] Is is good idea to use J for reading >> > large >> > > > XML >> > > > > > > > files ? >> > > > > > > > Date : 10/08/2021 18:05:45 Europe/Paris >> > > > > > > > >> > > > > > > > Hi, >> > > > > > > > >> > > > > > > > We are ecommerce company and have a lot of integrations with >> > > > > suppliers, >> > > > > > > > products info is nearly always in XML files. I am thinking >> > about >> > > > > using >> > > > > > J as >> > > > > > > > an analysis tool, do you think that working with large files >> > that >> > > > > need >> > > > > > to >> > > > > > > > be parsed SAX- style without reading everything at once is >> good >> > > > idea >> > > > > > in J ? >> > > > > > > > Also is this even advantageous (as in, would code be terse). >> > > Right >> > > > > now >> > > > > > XML >> > > > > > > > parsing is done in Golang, so if parsing in J is not very >> good >> > we >> > > > > > could try >> > > > > > > > to rely more on CSV exports. CSV is definiately very good >> in J. >> > > > > > > > I am hoping that maybe XML parsing is very good in J and the >> > code >> > > > > would >> > > > > > > > become much smaller, if this is the case, then I would think >> > > about >> > > > > > using J >> > > > > > > > for XMLs with new suppliers. >> > > > > > > > >> > > > > > > > Best Regards >> > > > > > > > M.G. >> > > > > > > > >> > > > > >> > ---------------------------------------------------------------------- >> > > > > > > > For information about J forums see >> > > > > http://www.jsoftware.com/forums.htm >> > > > > > > > >> > > > > > > > >> > > > > >> > ---------------------------------------------------------------------- >> > > > > > > > For information about J forums see >> > > > > http://www.jsoftware.com/forums.htm >> > > > > > > > >> > > > > > > >> > > > >> ---------------------------------------------------------------------- >> > > > > > > For information about J forums see >> > > > http://www.jsoftware.com/forums.htm >> > > > > > >> > > ---------------------------------------------------------------------- >> > > > > > For information about J forums see >> > > http://www.jsoftware.com/forums.htm >> > > > > > >> > > > > >> > > > > >> > > > > -- >> > > > > John D. Baker >> > > > > [email protected] >> > > > > >> > ---------------------------------------------------------------------- >> > > > > For information about J forums see >> > http://www.jsoftware.com/forums.htm >> > > > > >> > > > >> > > > >> > > > -- >> > > > >> > > > Devon McCormick, CFA >> > > > >> > > > Quantitative Consultant >> > > > >> ---------------------------------------------------------------------- >> > > > For information about J forums see >> http://www.jsoftware.com/forums.htm >> > > > >> > > ---------------------------------------------------------------------- >> > > For information about J forums see >> http://www.jsoftware.com/forums.htm >> > > >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> > >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
