your regex is not an exact duplicate of your J. In your J, opn and cls are searched separately and allows malformed tags, whereas your regex searches opn and cls pair together. That said, I would try xml utilities or library if the size of xml data is large.
On 11 Jun, 2017 11:17 pm, "Danil Osipchuk" <[email protected]> wrote: Frankly I even didn't consider to use regexp for parsing of production data (gigabytes/millions of records). Regexp must have a plenty of overhead to pass back and forth between the library and j engine, not to mention the whole complex logic supporting its complex semantics. This is all compared to a simple I.@E. which I believe to be highly polished. I'm not sure regexp will work for GB files at all. I've just tried regexp on a toy chunk of data I have at hand, indeed it is a lot slower. dir'' ... jqt.sh 126 06-Jan-17 07:25:22 mi-1b.xml 222153 05-Mar-17 22:34:10 ... 6!:2 'a=: (<''transmittedOctets'') #xmlTagDo fread''mi-1b.xml'' ' 0.000931 rx=: rxcomp '\<transmittedOctets\>(.*)\</transmittedOctets\>' 6!:2 'b =: #@> (rx;,1) rxall fread ''mi-1b.xml'' ' 0.064589 a-:b 1 6!:2 'b =: (rx;,1) #rxapply fread ''mi-1b.xml'' ' |domain error | ;,({."1 p),.(#p){.(#m)$x 2017-06-11 17:17 GMT+03:00 'Pascal Jasmin' via Programming < [email protected]>: > emit empty word is strongly needed IMO. > > I will probably write a mini-regex implementation for ;: (with *?+ > support) suitable for tag extraction that should be faster than regex. > > But the basics of that are to start word after opening string, use ev (no > start) at start of closing string, but then ew with no start when end of > closing string. > > But have you confirmed that regex is too slow for tag extraction? > > > > > ________________________________ > From: Danil Osipchuk <[email protected]> > To: Programming forum <[email protected]> > Sent: Sunday, June 11, 2017 9:59 AM > Subject: Re: [Jprogramming] Apply at start/lengths pairs > > > > I could not find a cutP definition with a quick look, but from your example > it seems like you mean a character separator by token. It is not general > enough. > > Also, imagine a fluffy xml file, with millions of records, where only a > minority of fields of different type in records are interesting, some nodes > have missing fields of the type you are interested in. > Parsing the whole file is plainly unfeasible because of performance and > complexity of the resulting code. > > Applying at selected positions obtained and reshaped by whatever means > however works rather well and is easy to reason about. > > As about fsm, I do remember that I found ;: inconvenient when I was trying > to apply it - and one issue was that there is no way to emit an empty word. > The other was that you have to have every possible value of input domain > represented as a row. When the domain are characters the mapping is > manageable, for everything else - not so much. As a vague idea, if there > was a way to condense the input domain through a verb, possibly a dyad to > pass an additional state, it would considerably expand the use of ;: with > some performance hit of course. > Also the code utilizing ;: is pretty much unreadable even by J standards. > That was my initial impressions about it. > > > 2017-06-11 15:43 GMT+03:00 'Pascal Jasmin' via Programming < > [email protected]>: > > > A more general procedure than your request is to cut your data such that > > your start/end segments are in odd positions > > > > > > in jpp, https://github.com/Pascal-J/jpp > > > > cutP is a process for cutting on start and end tokens, though there are > > faster methods in included fsm.ijs file. And that process could get > > significant boost if ;: were enhanced to support emitting empty boxes, > but: > > > > cutP '(asdf)g()' > > ++----+-+++ > > ||asdf|g||| > > ++----+-+++ > > > > cutP is dyadic for start and end tokens other than '()'. > > > > > > also from jpp, the AltM adverb takes a gerund to apply cyclically to such > > an above cut structure. > > > > > > a:"_`u AltM would produce empties for non-odd positions. > > > > But if you only care about the selections, then either regex, or a ;: > > definition can extract them. > > > > > > ________________________________ > > From: Danil Osipchuk <[email protected]> > > To: Programming forum <[email protected]> > > Sent: Sunday, June 11, 2017 7:19 AM > > Subject: [Jprogramming] Apply at start/lengths pairs > > > > > > > > Hi all, > > > > > > I wonder if there is an idiomatic way to apply a verb using an array of > > > > start and length pairs. This is a recurring pattern when extracting data > > > > from files. > > > > I've tried 3 adverbs (the example at the end), and the first one is > > > > slightly better on big files, but I'm still looking for possible > > > > improvements (the need is to extract selected fields from multi-gigabyte > > > > memory mapped csv/xml files) > > > > > > 'ab' xmlTagContentSL XML > > > > > > 11 1 > > > > > > 22 2 > > > > > > 34 3 > > > > > > 47 4 > > > > > > 'ab' <xmlTagDo XML > > > > > > +-+--+---+----+ > > > > > > |1|20|300|4000| > > > > > > +-+--+---+----+ > > > > > > (2 2 $ 'ab'xmlTagContentSL XML) <doSL XML > > > > > > +---+----+ > > > > > > |1 |20 | > > > > > > +---+----+ > > > > > > |300|4000| > > > > > > +---+----+ > > > > > > > > regards, > > > > Danil > > > > > > doSL =: 1 : '(,."1@[)u;.0]' NB. SL stands for start len pair > > > > > > NB. doSL =: 1 : '(0|:[:,:[)u;.0]' > > > > > > NB. doSL =: 1 : '(u;.0~ ,.)~"1' > > > > > > > > xmlTagOpn =: '<' ,'>',~] > > > > > > xmlTagCls =: '</','>',~] > > > > > > > > xmlTagContentSL =: 4 : 0 > > > > > > CS =. (xmlTagOpn >x) (#@[ + I.@E.) y > > > > > > CE =. (xmlTagCls >x) I.@E. y > > > > > > CS ,. CE-CS > > > > > > ) > > > > > > > > xmlTagDo =: 1 : '(xmlTagContentSL (u doSL) ])f.' > > > > > > > > XML =: 0 : 0 > > > > > > <data> > > > > > > <ab>1</ab> > > > > > > <ab>20</ab> > > > > > > <ab>300</ab> > > > > > > <ab>4000</ab> > > > > > > </data> > > > > > > ) > > > > ---------------------------------------------------------------------- > > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
