Which xml utilities or library? I though J had stopped supporting xml implementations?
-- Raul On Sun, Jun 11, 2017 at 10:24 PM, bill lam <[email protected]> wrote: > your regex is not an exact duplicate of your J. In your J, opn and cls are > searched separately and allows malformed tags, whereas your regex searches > opn and cls pair together. That said, I would try xml utilities or library > if the size of xml data is large. > > On 11 Jun, 2017 11:17 pm, "Danil Osipchuk" <[email protected]> wrote: > > Frankly I even didn't consider to use regexp for parsing of production data > (gigabytes/millions of records). Regexp must have a plenty of overhead to > pass back and forth between the library and j engine, not to mention the > whole complex logic supporting its complex semantics. This is all compared > to a simple I.@E. which I believe to be highly polished. I'm not sure > regexp will work for GB files at all. > > I've just tried regexp on a toy chunk of data I have at hand, indeed it is > a lot slower. > > dir'' > ... > > jqt.sh 126 06-Jan-17 07:25:22 > > mi-1b.xml 222153 05-Mar-17 22:34:10 > ... > > 6!:2 'a=: (<''transmittedOctets'') #xmlTagDo fread''mi-1b.xml'' ' > > 0.000931 > > rx=: rxcomp '\<transmittedOctets\>(.*)\</transmittedOctets\>' > > 6!:2 'b =: #@> (rx;,1) rxall fread ''mi-1b.xml'' ' > > 0.064589 > > a-:b > > 1 > > 6!:2 'b =: (rx;,1) #rxapply fread ''mi-1b.xml'' ' > > |domain error > > | ;,({."1 p),.(#p){.(#m)$x > > > > > > > 2017-06-11 17:17 GMT+03:00 'Pascal Jasmin' via Programming < > [email protected]>: > >> emit empty word is strongly needed IMO. >> >> I will probably write a mini-regex implementation for ;: (with *?+ >> support) suitable for tag extraction that should be faster than regex. >> >> But the basics of that are to start word after opening string, use ev (no >> start) at start of closing string, but then ew with no start when end of >> closing string. >> >> But have you confirmed that regex is too slow for tag extraction? >> >> >> >> >> ________________________________ >> From: Danil Osipchuk <[email protected]> >> To: Programming forum <[email protected]> >> Sent: Sunday, June 11, 2017 9:59 AM >> Subject: Re: [Jprogramming] Apply at start/lengths pairs >> >> >> >> I could not find a cutP definition with a quick look, but from your > example >> it seems like you mean a character separator by token. It is not general >> enough. >> >> Also, imagine a fluffy xml file, with millions of records, where only a >> minority of fields of different type in records are interesting, some > nodes >> have missing fields of the type you are interested in. >> Parsing the whole file is plainly unfeasible because of performance and >> complexity of the resulting code. >> >> Applying at selected positions obtained and reshaped by whatever means >> however works rather well and is easy to reason about. >> >> As about fsm, I do remember that I found ;: inconvenient when I was trying >> to apply it - and one issue was that there is no way to emit an empty > word. >> The other was that you have to have every possible value of input domain >> represented as a row. When the domain are characters the mapping is >> manageable, for everything else - not so much. As a vague idea, if there >> was a way to condense the input domain through a verb, possibly a dyad to >> pass an additional state, it would considerably expand the use of ;: with >> some performance hit of course. >> Also the code utilizing ;: is pretty much unreadable even by J standards. >> That was my initial impressions about it. >> >> >> 2017-06-11 15:43 GMT+03:00 'Pascal Jasmin' via Programming < >> [email protected]>: >> >> > A more general procedure than your request is to cut your data such that >> > your start/end segments are in odd positions >> > >> > >> > in jpp, https://github.com/Pascal-J/jpp >> > >> > cutP is a process for cutting on start and end tokens, though there are >> > faster methods in included fsm.ijs file. And that process could get >> > significant boost if ;: were enhanced to support emitting empty boxes, >> but: >> > >> > cutP '(asdf)g()' >> > ++----+-+++ >> > ||asdf|g||| >> > ++----+-+++ >> > >> > cutP is dyadic for start and end tokens other than '()'. >> > >> > >> > also from jpp, the AltM adverb takes a gerund to apply cyclically to > such >> > an above cut structure. >> > >> > >> > a:"_`u AltM would produce empties for non-odd positions. >> > >> > But if you only care about the selections, then either regex, or a ;: >> > definition can extract them. >> > >> > >> > ________________________________ >> > From: Danil Osipchuk <[email protected]> >> > To: Programming forum <[email protected]> >> > Sent: Sunday, June 11, 2017 7:19 AM >> > Subject: [Jprogramming] Apply at start/lengths pairs >> > >> > >> > >> > Hi all, >> > >> > >> > I wonder if there is an idiomatic way to apply a verb using an array of >> > >> > start and length pairs. This is a recurring pattern when extracting data >> > >> > from files. >> > >> > I've tried 3 adverbs (the example at the end), and the first one is >> > >> > slightly better on big files, but I'm still looking for possible >> > >> > improvements (the need is to extract selected fields from multi-gigabyte >> > >> > memory mapped csv/xml files) >> > >> > >> > 'ab' xmlTagContentSL XML >> > >> > >> > 11 1 >> > >> > >> > 22 2 >> > >> > >> > 34 3 >> > >> > >> > 47 4 >> > >> > >> > 'ab' <xmlTagDo XML >> > >> > >> > +-+--+---+----+ >> > >> > >> > |1|20|300|4000| >> > >> > >> > +-+--+---+----+ >> > >> > >> > (2 2 $ 'ab'xmlTagContentSL XML) <doSL XML >> > >> > >> > +---+----+ >> > >> > >> > |1 |20 | >> > >> > >> > +---+----+ >> > >> > >> > |300|4000| >> > >> > >> > +---+----+ >> > >> > >> > >> > regards, >> > >> > Danil >> > >> > >> > doSL =: 1 : '(,."1@[)u;.0]' NB. SL stands for start len pair >> > >> > >> > NB. doSL =: 1 : '(0|:[:,:[)u;.0]' >> > >> > >> > NB. doSL =: 1 : '(u;.0~ ,.)~"1' >> > >> > >> > >> > xmlTagOpn =: '<' ,'>',~] >> > >> > >> > xmlTagCls =: '</','>',~] >> > >> > >> > >> > xmlTagContentSL =: 4 : 0 >> > >> > >> > CS =. (xmlTagOpn >x) (#@[ + I.@E.) y >> > >> > >> > CE =. (xmlTagCls >x) I.@E. y >> > >> > >> > CS ,. CE-CS >> > >> > >> > ) >> > >> > >> > >> > xmlTagDo =: 1 : '(xmlTagContentSL (u doSL) ])f.' >> > >> > >> > >> > XML =: 0 : 0 >> > >> > >> > <data> >> > >> > >> > <ab>1</ab> >> > >> > >> > <ab>20</ab> >> > >> > >> > <ab>300</ab> >> > >> > >> > <ab>4000</ab> >> > >> > >> > </data> >> > >> > >> > ) >> > >> > ---------------------------------------------------------------------- >> > >> > For information about J forums see http://www.jsoftware.com/forums.htm >> >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> > >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
