Thank you for taking time to look at this. Is this something wrong with how the FFI layer that calls expat is written or could someone just tweak J part for it to work like that ?
I will try to understand how those FFI libraries work, do I just need to only know C language or are there some other prerequisites to understand it ? Best regards, M.G. śr., 8 wrz 2021 o 02:04 Raul Miller <[email protected]> napisał(a): > Hmm.... > > I think I see at least a part of the issues you are talking about. > > Currently, J's api/expat expects you to provide the complete contents > of the xml document when you call expat_parse_xml > > For large files, it would make sense to have a similarly structured > expat_parse_file which takes a file name argument and parses it a > block at a time. (And, perhaps this should have an optional block size > argument -- though maybe a default of one megabyte should work for > most contexts). > > The steadily increasing memory use while it's parsing suggests that > expat itself might only be freeing its memory after it parses the > block of text which it was given. > > Thanks, > > -- > Raul > > On Tue, Sep 7, 2021 at 5:22 PM Mariusz Grasko <[email protected]> > wrote: > > > > Raul, > > > > This large RAM usage happened for all files that I tested, there was > > nothing particurarly special about them. It is mostly products details > from > > warehouses: > > ID SKU EAN NAME DESC QNT PRICE TAX IMAGE this type of elements. Some of > > them are attributes (price, tax, suggested price as attributes in price > > element) but this is mostly text data inside elements. Product > descriptions > > usually take most space. > > > > I will do some more experiments on this tomorrow and try tu measure when > > exactly RAM usage increases. I am pretty sure it is in 'elm att val'=. x > > > > Spike was probably bad word, there is constant increase in RAM usage, > like > > there is no freeing of any memory before parsing finishes. Once it > > finishes, memory is freed or my system crashes if file is large enough > (or > > i kill process before crash). The only asignement I have is this line > that > > I mentioned, it seems to be required to run expat_parse_xml > > > > Best regards, > > M.G. > > > > W dniu wtorek, 7 września 2021 Raul Miller <[email protected]> > > napisał(a): > > > > > Looking at the implementation, it's literally setting up the expat > > > parser to parse your file and then responding to your callbacks. > > > > > > If 'elm att val'=. x is creating a memory spike for you, that suggests > > > that your test file contains extremely long xml elements. (Not text > > > data, not nested xml elements, but probably extremely long lists of > > > attributes.) > > > > > > That memory is freeable when your verb execution instance exits. > > > > > > That said, I haven't studied expat closely, and it looks like the > > > implementation expects that expat will free memory provided as > > > callback arguments. If this is not the case, that would cause some > > > issues. > > > > > > Thanks, > > > > > > -- > > > Raul > > > > > > > > > On Tue, Sep 7, 2021 at 7:57 AM Mariusz Grasko < > [email protected]> > > > wrote: > > > > > > > > So after some break and experimenting with other stuff I have > revisited > > > > api/expat. It seems that expat, the C library for stream-oriented XML > > > > parsing, should not load elements to RAM and only process as you go, > > > > yielding tokens like StartElements, chardata, attributes etc. In > this J > > > > library it seems to not be a case, unless I am misunderstanding how > to > > > use > > > > it (which is most likely explanation). > > > > > > > > This is my version of dumb parser that should do nothing, just pass > > > through > > > > a file without capturing any data inside variables: > > > > > > > > NB. PROGRAM START > > > > > > > > require 'api/expat' > > > > > > > > coinsert 'jexpat' > > > > > > > > > > > > expat_initx=: 3 : 0 > > > > > > > > id_offset=: y > > > > > > > > elements=: 0 0$'' > > > > > > > > idnames=: 0$<'' > > > > > > > > parents=: 0$0 > > > > > > > > ) > > > > > > > > > > > > expat_start_elementx=: 4 : 0 > > > > > > > > 'elm att val'=. x > > > > > > > > smoutput 7!:0'' > > > > > > > > EMPTY > > > > > > > > ) > > > > > > > > > > > > expat_end_elementx=: 3 : 0 > > > > > > > > EMPTY > > > > > > > > ) > > > > > > > > > > > > expat_parse_xmlx=: 3 : 0 > > > > > > > > EMPTY > > > > > > > > ) > > > > > > > > > > > > 1 expat_parse_xml 1!:1 < > > > > 'C:\Users\mariusz\Documents\Programing\J\parseXML\test.xml' > > > > > > > > > > > > smoutput 'FINISH' > > > > > > > > NB. PROGRAM END > > > > > > > > > > > > I have added smoutput 7!:0'' to confirm memory usage increase. I > already > > > > knew that because my large file would eventually crash Windows. Small > > > files > > > > take small amount of memory and larger files take more (not sure if > it > > > > linear realationship, haven't tested it). Enough to say that 296MB > file > > > > would eventually get over 6GB of RAM usage. > > > > > > > > My quesses as someone who is just starting J: > > > > > > > > 1). 'elm att val'=. x this line is responsible for this memory spike, > > > maybe > > > > there is a way to free up those variables before parsing next start > > > element > > > > ? > > > > > > > > 2). There is some accumulation happening befind the curtains maybe > for > > > > buffering and speeding up purposes - is there a way to limit it or > > > reclaim > > > > memory ? > > > > > > > > 3). My dumb parser is actually not just running through file and does > > > > something which I don't understand that causes it to capture data. > > > > > > > > 4). This impelenation of J expat is not stream-parser it just > utilizes > > > > Expat, but loads whole document to memory beforehand. (but then why > does > > > > memory increase take time to read next element which is confirmed by > > > > smoutput 7!:0'' > > > > > > > > > > > > Best regards, > > > > > > > > M.G. > > > > > > > > wt., 17 sie 2021 o 20:45 Jan-Pieter Jacobs < > [email protected]> > > > > napisał(a): > > > > > > > > > Just as no one has not been mentioned before: Memory-mapped files > also > > > work > > > > > very well for big data files. > > > > > > > > > > > > > > > See the "Mapped Files" lab (Help > Studio > Labs > Mapped Files) > and > > > > > "doc_jmf_ [ load'jmf' " in a J session for a short summary. > > > > > > > > > > You can map any file as character data with > > > > > > > > > > JCHAR map_jmf_ 'var';'filename.csv' > > > > > > > > > > Now the variable 'var' will look as if it contains all your data. > > > > > 2 warnings though: > > > > > 1) Mind the mode: the default mode to map a file is RW, so changes > to > > > the > > > > > variable are immediately written to disk. You can use MTRO_jmf_ as > > > mode for > > > > > avoiding this kind of mistakes. > > > > > 2) Watch out with operations that might copy a big chunk of the > > > variable, > > > > > (e.g. var2=:var ; var2=: }. var ; ...) which could make your > program, > > > or > > > > > others crash due to memory exhaustion. > > > > > Things like this do work well though: > > > > > LF +/@:= var NB. number of LF's, i.e. line count > > > > > > > > > > I once used them to grok through some 4GB CSV's to split them into > > > records, > > > > > which went quite satisfactorily. To make matters easier for myself > > > > > afterwards, I wrote them out just sequentially, recording in a > normal > > > array > > > > > a list of the starting point and length pairs of each records for > easy > > > > > indexing into the files when mapped again using map_jmf_ : E.g. get > > > record > > > > > n using: recordsJMF {~ (+i.)/ n { start_len ; or even: recordsJMF > > > ({:@] {. > > > > > {.@] }. [) n { start_len. The last is in recent versions of J far > more > > > > > efficient for long lengths due to using virtual nouns (see > > > > > > > > > > https://code.jsoftware.com/wiki/Vocabulary/ > > > SpecialCombinations#Virtual_Nouns > > > > > ) > > > > > > > > > > Of course, what makes sense to you probably depends on your > situation, > > > e.g. > > > > > whether the data is going to change, how many times you intend to > use > > > the > > > > > same data, ... > > > > > > > > > > Best regards, > > > > > Jan-Pieter > > > > > > > > > > Op di 17 aug. 2021 om 18:56 schreef Devon McCormick < > > > [email protected]>: > > > > > > > > > > > Hi Mariuz, > > > > > > A while back I wrote some adverbs to apply a verb across a file > in > > > > > pieces: > > > > > > https://github.com/DevonMcC/JUtilities/blob/master/ > > > workOnLargeFile.ijs . > > > > > > > > > > > > The simplest, most general one is "doSomethingSimple". It > applies > > > the > > > > > > supplied verb to successive chunks of the file and allows work > > > already > > > > > done > > > > > > to be passed on the next iteration. > > > > > > > > > > > > NB.* doSomethingSimple: apply verb to file making minimal > assumptions > > > > > > about file structure. > > > > > > doSomethingSimple=: 1 : 0 > > > > > > 'curptr chsz max flnm passedOn'=. 5{.y > > > > > > if. curptr>:max do. ch=. curptr;chsz;max;flnm > > > > > > else. ch=. readChunk curptr;chsz;max;flnm > > > > > > passedOn=. u (_1{ch),<passedOn NB. Allow u's work to be > > > passed on > > > > > > to next invocation > > > > > > end. > > > > > > (4{.ch),<passedOn > > > > > > NB.EG ([:~.;) doSomethingSimple ^:_ ] 0x;1e6;(fsize > > > > > > 'bigFile.txt');'bigFile.txt';<'' NB. Return unique characters in > > > file. > > > > > > ) > > > > > > > > > > > > The sub-function "readChunk" looks like this: > > > > > > > > > > > > readChunk=: 3 : 0 > > > > > > 'curptr chsz max flnm'=. 4{.y > > > > > > if. 0<chsz2=. chsz<.0>.max-curptr do. chunk=. fread > > > flnm;curptr,chsz2 > > > > > > else. chunk=. '' end. > > > > > > (curptr+chsz2);chsz2;max;flnm;chunk > > > > > > NB.EG chunk=. >_1{ch0=. readChunk 0;1e6;(fsize > > > > > > 'bigFile.txt');'bigFile.txt' > > > > > > ) > > > > > > > > > > > > Another adverb "doSomething" is similar but assumes you have > > > something > > > > > like > > > > > > line delimiters and you only want to process complete lines each > time > > > > > > through. > > > > > > > > > > > > If you get a chance to take a look at these, please let me know > what > > > you > > > > > > think. > > > > > > > > > > > > Good luck, > > > > > > > > > > > > Devon > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Aug 17, 2021 at 12:34 PM John Baker <[email protected] > > > > > wrote: > > > > > > > > > > > > > Mariuz, > > > > > > > > > > > > > > I've used the following adverb (see below) to process 4gig > CSVs. > > > > > > Basically > > > > > > > it works > > > > > > > through the file in byte chunks. As the j forum email tends to > > > wreak > > > > > > > embedded > > > > > > > code you can see how this adv is used in the database ETL > system > > > that > > > > > > uses > > > > > > > it > > > > > > > here: > > > > > > > > > > > > > > https://bakerjd99.files.wordpress.com/2021/08/swiftprep.pdf > > > > > > > > > > > > > > You might also find this amusing: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://analyzethedatanotthedrivel.org/2021/08/11/jetl-j-extract- > > > transform-and-load/ > > > > > > > > > > > > > > ireadapply=:1 : 0 > > > > > > > > > > > > > > > > > > > > > NB.*ireadapply v-- apply verb (u) to n byte line blocks of > static > > > file. > > > > > > > > > > > > > > NB. > > > > > > > > > > > > > > NB. adv: u ireadapply (clFileIn ; clFileOut ; clDel ; > iaBlockSize > > > ;< > > > > > > > uuData) > > > > > > > > > > > > > > NB. > > > > > > > > > > > > > > NB. fi=. winpathsep ;1 dir SwiftZipCsvDir,'ItemSales-*.csv' > > > > > > > > > > > > > > NB. fo=. SwiftTsvDir,'land_ItemSales.txt' > > > > > > > > > > > > > > NB. smoutput@:(>@{. ; ($&.>)@:}.) ireadapply > > > fi;fo;CRLF;20000000;<'' > > > > > > > > > > > > > > > > > > > > > NB. file in, file out, line delimiter, block size, (u) verb > data > > > > > > > > > > > > > > 'fi fo d k ud'=. y > > > > > > > > > > > > > > > > > > > > > p=. 0 NB. file pointer > > > > > > > > > > > > > > c=. 0 NB. block count > > > > > > > > > > > > > > s=. fsize fi NB. file bytes > > > > > > > > > > > > > > k=. k<.s NB. first block size > > > > > > > > > > > > > > NB.debug. b=. i.0 NB. block sizes (chk) > > > > > > > > > > > > > > > > > > > > > while. p < s do. > > > > > > > > > > > > > > 'iread error' assert -. _1 -: r=. (1!:11 :: _1:) fi;p,k > > > > > > > > > > > > > > c=. >:c NB. block count > > > > > > > > > > > > > > NB. complete lines > > > > > > > > > > > > > > if. 0 = #l=. d beforelaststr r do. > > > > > > > > > > > > > > NB. final shard > > > > > > > > > > > > > > NB.debug. b=. b,#r > > > > > > > > > > > > > > u c;1;d;fo;r;<ud break. > > > > > > > > > > > > > > end. > > > > > > > > > > > > > > p=. p + #l NB. inc file pointer > > > > > > > > > > > > > > k=. k <. s - p NB. next block size > > > > > > > > > > > > > > NB.debug. b=. b,#l NB. block sizes list > > > > > > > > > > > > > > NB. block number, shard, delimiter, file out, line bytes, (u) > data > > > > > > > > > > > > > > u c;0;d;fo;l;<ud > > > > > > > > > > > > > > end. > > > > > > > > > > > > > > > > > > > > > NB.debug. 'byte mismatch' assert s = +/b > > > > > > > > > > > > > > c NB. blocks processed > > > > > > > > > > > > > > ) > > > > > > > > > > > > > > On Mon, Aug 16, 2021 at 7:17 PM Raul Miller < > [email protected] > > > > > > > > > > wrote: > > > > > > > > > > > > > > > 1, As you have noticed, certainly. There's details, of course > > > (what > > > > > > > > block size to use? Are files guaranteed to be well formed? If > > > not, > > > > > > > > what are error conditions? (are certain characters illegal? > Are > > > lines > > > > > > > > longer than the block size allowed?) Do you want a callback > > > interface > > > > > > > > for each block? If so, do you need an "end of file" > indication? > > > If > > > > > so, > > > > > > > > is that a separate callback or a distinct argument to the > block > > > > > > > > callback? etc.) > > > > > > > > > > > > > > > > 2. Again, as you have noticed: yes. And, there are analogous > > > details > > > > > > > > here... > > > > > > > > > > > > > > > > 3. The expat API should only require J knowledge. There are a > > > couple > > > > > > > > examples in the addons/api/expat/test/ directory named > test0.ijs > > > and > > > > > > > > test1.ijs > > > > > > > > > > > > > > > > I hope this helps, > > > > > > > > > > > > > > > > -- > > > > > > > > Raul > > > > > > > > > > > > > > > > On Mon, Aug 16, 2021 at 4:23 PM Mariusz Grasko > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > > Thank you for some ideas on using external parser. > > > > > > > > > Okay now I have 3 questions: > > > > > > > > > 1. Is it possible to read CSV file streaming-style (for > example > > > > > > record > > > > > > > by > > > > > > > > > record) without loading everything in memory ? Even if I > use > > > some > > > > > > > > external > > > > > > > > > parsing solution like XSLT or just write something myself > in > > > some > > > > > > other > > > > > > > > > language than J, I will end up with large CSV instead of > large > > > XML. > > > > > > It > > > > > > > > > makes no difference. The reason that I need to parse it > like > > > this, > > > > > is > > > > > > > > that > > > > > > > > > there are some rows that I won't need, those would be > discarded > > > > > > > depending > > > > > > > > > on their field values. > > > > > > > > > If it is not possible I would do more work outside of J in > this > > > > > first > > > > > > > > > parser XML -> CSV. > > > > > > > > > 2. Is there a way to call external program for J script ? > If > > > it is > > > > > > > > > possible to wait for it to finish ? > > > > > > > > > If it is not possible, there are definiately ways to run J > from > > > > > other > > > > > > > > > programs. > > > > > > > > > 3. Can someone give a little bit of pointer or on how to > use > > > > > > api/expat > > > > > > > > > library ? Do I need to familiarize myself with expat (C > > > library) or > > > > > > > just > > > > > > > > > good understanding of J and reading small test in package > > > directory > > > > > > > > should > > > > > > > > > be enough ? > > > > > > > > > I could send some example file like Devon McCormick > suggested. > > > > > > > > > > > > > > > > > > Right now I am working through book "J:The natural > language for > > > > > > > analytic > > > > > > > > > computing" and playing around with problems like Project > Euler, > > > > > but I > > > > > > > > could > > > > > > > > > really see myself using J in serious work. > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > MG > > > > > > > > > > > > > > > > > > > > > > > > > > > śr., 11 sie 2021 o 09:51 <[email protected]> > napisał(a): > > > > > > > > > > > > > > > > > > > In similar situations -but my files are not huge- I > extract > > > what > > > > > I > > > > > > > want > > > > > > > > > > into flattened CSV using one or more XQuery scripts, and > then > > > > > load > > > > > > > the > > > > > > > > CSV > > > > > > > > > > files with J. The code is clean, compact and easy to > > > maintain. > > > > > For > > > > > > > > > > recurrent XQuery patterns, m4 occasionally comes to the > > > rescue. > > > > > > > Expect > > > > > > > > > > minor portability issues when using different XQuery > > > processors > > > > > > > > > > (extensions, language level...). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Never got round to SAX parsing beyond tutorials, so I > cannot > > > > > > compare. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > De : Mariusz Grasko <[email protected]> > > > > > > > > > > À : [email protected] > > > > > > > > > > Sujet : [Jprogramming] Is is good idea to use J for > reading > > > large > > > > > > XML > > > > > > > > > > files ? > > > > > > > > > > Date : 10/08/2021 18:05:45 Europe/Paris > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > We are ecommerce company and have a lot of integrations > with > > > > > > > suppliers, > > > > > > > > > > products info is nearly always in XML files. I am > thinking > > > about > > > > > > > using > > > > > > > > J as > > > > > > > > > > an analysis tool, do you think that working with large > files > > > that > > > > > > > need > > > > > > > > to > > > > > > > > > > be parsed SAX- style without reading everything at once > is > > > good > > > > > > idea > > > > > > > > in J ? > > > > > > > > > > Also is this even advantageous (as in, would code be > terse). > > > > > Right > > > > > > > now > > > > > > > > XML > > > > > > > > > > parsing is done in Golang, so if parsing in J is not very > > > good we > > > > > > > > could try > > > > > > > > > > to rely more on CSV exports. CSV is definiately very > good in > > > J. > > > > > > > > > > I am hoping that maybe XML parsing is very good in J and > the > > > code > > > > > > > would > > > > > > > > > > become much smaller, if this is the case, then I would > think > > > > > about > > > > > > > > using J > > > > > > > > > > for XMLs with new suppliers. > > > > > > > > > > > > > > > > > > > > Best Regards > > > > > > > > > > M.G. > > > > > > > > > > > > > > > > > ------------------------------------------------------------ > > > ---------- > > > > > > > > > > For information about J forums see > > > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------ > > > ---------- > > > > > > > > > > For information about J forums see > > > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > > > > > > > > > > > > > ------------------------------------------------------------ > > > ---------- > > > > > > > > > For information about J forums see > > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > > > > For information about J forums see > > > > > http://www.jsoftware.com/forums.htm > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > John D. Baker > > > > > > > [email protected] > > > > > > > ------------------------------------------------------------ > > > ---------- > > > > > > > For information about J forums see http://www.jsoftware.com/ > > > forums.htm > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Devon McCormick, CFA > > > > > > > > > > > > Quantitative Consultant > > > > > > ------------------------------------------------------------ > > > ---------- > > > > > > For information about J forums see http://www.jsoftware.com/ > > > forums.htm > > > > > > > > > > > > ---------------------------------------------------------------------- > > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > > > > > > > > ---------------------------------------------------------------------- > > > > For information about J forums see > http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
