Frankly I even didn't consider to use regexp for parsing of production data
(gigabytes/millions of records). Regexp must have a plenty of overhead to
pass back and forth between the library and j engine, not to mention the
whole complex logic supporting its complex semantics. This is all compared
to a simple I.@E. which I believe to be highly polished. I'm not sure
regexp will work for GB files at all.
I've just tried regexp on a toy chunk of data I have at hand, indeed it is
a lot slower.
dir''
...
jqt.sh 126 06-Jan-17 07:25:22
mi-1b.xml 222153 05-Mar-17 22:34:10
...
6!:2 'a=: (<''transmittedOctets'') #xmlTagDo fread''mi-1b.xml'' '
0.000931
rx=: rxcomp '\<transmittedOctets\>(.*)\</transmittedOctets\>'
6!:2 'b =: #@> (rx;,1) rxall fread ''mi-1b.xml'' '
0.064589
a-:b
1
6!:2 'b =: (rx;,1) #rxapply fread ''mi-1b.xml'' '
|domain error
| ;,({."1 p),.(#p){.(#m)$x
2017-06-11 17:17 GMT+03:00 'Pascal Jasmin' via Programming <
[email protected]>:
> emit empty word is strongly needed IMO.
>
> I will probably write a mini-regex implementation for ;: (with *?+
> support) suitable for tag extraction that should be faster than regex.
>
> But the basics of that are to start word after opening string, use ev (no
> start) at start of closing string, but then ew with no start when end of
> closing string.
>
> But have you confirmed that regex is too slow for tag extraction?
>
>
>
>
> ________________________________
> From: Danil Osipchuk <[email protected]>
> To: Programming forum <[email protected]>
> Sent: Sunday, June 11, 2017 9:59 AM
> Subject: Re: [Jprogramming] Apply at start/lengths pairs
>
>
>
> I could not find a cutP definition with a quick look, but from your example
> it seems like you mean a character separator by token. It is not general
> enough.
>
> Also, imagine a fluffy xml file, with millions of records, where only a
> minority of fields of different type in records are interesting, some nodes
> have missing fields of the type you are interested in.
> Parsing the whole file is plainly unfeasible because of performance and
> complexity of the resulting code.
>
> Applying at selected positions obtained and reshaped by whatever means
> however works rather well and is easy to reason about.
>
> As about fsm, I do remember that I found ;: inconvenient when I was trying
> to apply it - and one issue was that there is no way to emit an empty word.
> The other was that you have to have every possible value of input domain
> represented as a row. When the domain are characters the mapping is
> manageable, for everything else - not so much. As a vague idea, if there
> was a way to condense the input domain through a verb, possibly a dyad to
> pass an additional state, it would considerably expand the use of ;: with
> some performance hit of course.
> Also the code utilizing ;: is pretty much unreadable even by J standards.
> That was my initial impressions about it.
>
>
> 2017-06-11 15:43 GMT+03:00 'Pascal Jasmin' via Programming <
> [email protected]>:
>
> > A more general procedure than your request is to cut your data such that
> > your start/end segments are in odd positions
> >
> >
> > in jpp, https://github.com/Pascal-J/jpp
> >
> > cutP is a process for cutting on start and end tokens, though there are
> > faster methods in included fsm.ijs file. And that process could get
> > significant boost if ;: were enhanced to support emitting empty boxes,
> but:
> >
> > cutP '(asdf)g()'
> > ++----+-+++
> > ||asdf|g|||
> > ++----+-+++
> >
> > cutP is dyadic for start and end tokens other than '()'.
> >
> >
> > also from jpp, the AltM adverb takes a gerund to apply cyclically to such
> > an above cut structure.
> >
> >
> > a:"_`u AltM would produce empties for non-odd positions.
> >
> > But if you only care about the selections, then either regex, or a ;:
> > definition can extract them.
> >
> >
> > ________________________________
> > From: Danil Osipchuk <[email protected]>
> > To: Programming forum <[email protected]>
> > Sent: Sunday, June 11, 2017 7:19 AM
> > Subject: [Jprogramming] Apply at start/lengths pairs
> >
> >
> >
> > Hi all,
> >
> >
> > I wonder if there is an idiomatic way to apply a verb using an array of
> >
> > start and length pairs. This is a recurring pattern when extracting data
> >
> > from files.
> >
> > I've tried 3 adverbs (the example at the end), and the first one is
> >
> > slightly better on big files, but I'm still looking for possible
> >
> > improvements (the need is to extract selected fields from multi-gigabyte
> >
> > memory mapped csv/xml files)
> >
> >
> > 'ab' xmlTagContentSL XML
> >
> >
> > 11 1
> >
> >
> > 22 2
> >
> >
> > 34 3
> >
> >
> > 47 4
> >
> >
> > 'ab' <xmlTagDo XML
> >
> >
> > +-+--+---+----+
> >
> >
> > |1|20|300|4000|
> >
> >
> > +-+--+---+----+
> >
> >
> > (2 2 $ 'ab'xmlTagContentSL XML) <doSL XML
> >
> >
> > +---+----+
> >
> >
> > |1 |20 |
> >
> >
> > +---+----+
> >
> >
> > |300|4000|
> >
> >
> > +---+----+
> >
> >
> >
> > regards,
> >
> > Danil
> >
> >
> > doSL =: 1 : '(,."1@[)u;.0]' NB. SL stands for start len pair
> >
> >
> > NB. doSL =: 1 : '(0|:[:,:[)u;.0]'
> >
> >
> > NB. doSL =: 1 : '(u;.0~ ,.)~"1'
> >
> >
> >
> > xmlTagOpn =: '<' ,'>',~]
> >
> >
> > xmlTagCls =: '</','>',~]
> >
> >
> >
> > xmlTagContentSL =: 4 : 0
> >
> >
> > CS =. (xmlTagOpn >x) (#@[ + I.@E.) y
> >
> >
> > CE =. (xmlTagCls >x) I.@E. y
> >
> >
> > CS ,. CE-CS
> >
> >
> > )
> >
> >
> >
> > xmlTagDo =: 1 : '(xmlTagContentSL (u doSL) ])f.'
> >
> >
> >
> > XML =: 0 : 0
> >
> >
> > <data>
> >
> >
> > <ab>1</ab>
> >
> >
> > <ab>20</ab>
> >
> >
> > <ab>300</ab>
> >
> >
> > <ab>4000</ab>
> >
> >
> > </data>
> >
> >
> > )
> >
> > ----------------------------------------------------------------------
> >
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm