I didn't try to bring examples very close and also I've simplified the code
leaving only general idea as illustration. The real code asserts that
positions of opening and closing tags are alternating and later there is a
normalization step checking if the tag is indeed inside of a container
record selecting only relevant ones and doing fills. It is still rather
obvious that regexp is much slower (I do use it when extracting data from
logs).
As of going library route - my first attempt was to apply xslt template
using libxslt/libxml to convert selected fields into csv and then to parse
it. This resulted in a bulky and somewhat clumsy and fragile code
(especially the 'declarative' part  - writing a robust xslt proved to be
difficult to get right). Before digging into low level libxml parsing, I've
tried a naive approach outlined - and it worked well  and I'll try to stick
with it.


openTag =: '<' ,'>',~]

closeTag =: '</','>',~]

isSortedUp =: [:*./ 2 <:/\ ]

tagStartEnd =: 4 : 0

TS =. (openTag >x) (#@[ + I.@E.) y

TE =. (closeTag >x) I.@E. y

assert. TS =&# TE

assert. isSortedUp , [TSE =.|: TS ,: TE

TSE

)


startLenFromSE =:({.,:{:-{.)&.|:

tagStartLen =: startLenFromSE@:tagStartEnd


doSL =: 1 : '(,."1@[)u;.0]'

tagDo =: 1 : '(tagStartLen (u doSL) ])f.'



NB. not a big difference:

6!:2 'a=: (<''transmittedOctets'') #tagDo fread''mi-1b.xml'' '

0.001385


2017-06-12 5:24 GMT+03:00 bill lam <[email protected]>:

> your regex is not an exact duplicate of your J. In your J, opn and cls are
> searched separately and allows malformed tags, whereas your regex searches
> opn and cls pair together. That said, I would try xml utilities or library
> if the size of xml data is large.
>
> On 11 Jun, 2017 11:17 pm, "Danil Osipchuk" <[email protected]>
> wrote:
>
> Frankly I even didn't consider to use regexp for parsing of production data
> (gigabytes/millions of records). Regexp must have a plenty of overhead to
> pass  back and forth between the library and j engine, not to mention the
> whole complex logic supporting its complex semantics. This is all compared
> to a simple I.@E. which I believe to be highly polished. I'm not sure
> regexp will work for GB files at all.
>
> I've just tried regexp on a toy chunk of data I have at hand, indeed it is
> a lot slower.
>
> dir''
> ...
>
> jqt.sh 126 06-Jan-17 07:25:22
>
> mi-1b.xml 222153 05-Mar-17 22:34:10
> ...
>
> 6!:2 'a=: (<''transmittedOctets'') #xmlTagDo fread''mi-1b.xml'' '
>
> 0.000931
>
> rx=: rxcomp '\<transmittedOctets\>(.*)\</transmittedOctets\>'
>
> 6!:2 'b =: #@> (rx;,1) rxall fread ''mi-1b.xml'' '
>
> 0.064589
>
> a-:b
>
> 1
>
> 6!:2 'b =: (rx;,1) #rxapply fread ''mi-1b.xml'' '
>
> |domain error
>
> | ;,({."1 p),.(#p){.(#m)$x
>
>
>
>
>
>
> 2017-06-11 17:17 GMT+03:00 'Pascal Jasmin' via Programming <
> [email protected]>:
>
> > emit empty word is strongly needed IMO.
> >
> > I will probably write a mini-regex implementation for ;: (with *?+
> > support) suitable for tag extraction that should be faster than regex.
> >
> > But the basics of that are to start word after opening string, use ev (no
> > start) at start of closing string, but then ew with no start when end of
> > closing string.
> >
> > But have you confirmed that regex is too slow for tag extraction?
> >
> >
> >
> >
> > ________________________________
> > From: Danil Osipchuk <[email protected]>
> > To: Programming forum <[email protected]>
> > Sent: Sunday, June 11, 2017 9:59 AM
> > Subject: Re: [Jprogramming] Apply at start/lengths pairs
> >
> >
> >
> > I could not find a cutP definition with a quick look, but from your
> example
> > it seems like you mean a character separator by token. It is not general
> > enough.
> >
> > Also, imagine a fluffy xml file, with millions of records, where only a
> > minority of fields of different type in records are interesting, some
> nodes
> > have missing fields of the type you are interested in.
> > Parsing the whole file is plainly unfeasible because of performance and
> > complexity of the resulting code.
> >
> > Applying at selected positions obtained and reshaped by whatever means
> > however works rather well and is easy to reason about.
> >
> > As about fsm, I do remember that I found ;: inconvenient when I was
> trying
> > to apply it - and one issue was that there is no way to emit an empty
> word.
> > The other was that you have to have every possible value of input domain
> > represented as a row. When the domain are characters the mapping is
> > manageable, for everything else - not so much. As a vague idea, if there
> > was a way to condense the input domain through a verb, possibly a dyad to
> > pass an additional state, it would considerably expand the use of ;: with
> > some performance hit of course.
> > Also the code utilizing ;: is pretty much unreadable even by J standards.
> > That was my initial impressions about it.
> >
> >
> > 2017-06-11 15:43 GMT+03:00 'Pascal Jasmin' via Programming <
> > [email protected]>:
> >
> > > A more general procedure than your request is to cut your data such
> that
> > > your start/end segments are in odd positions
> > >
> > >
> > > in jpp, https://github.com/Pascal-J/jpp
> > >
> > > cutP is a process for cutting on start and end tokens, though there are
> > > faster methods in included fsm.ijs file.  And that process could get
> > > significant boost if ;: were enhanced to support emitting empty boxes,
> > but:
> > >
> > > cutP '(asdf)g()'
> > > ++----+-+++
> > > ||asdf|g|||
> > > ++----+-+++
> > >
> > > cutP is dyadic for start and end tokens other than '()'.
> > >
> > >
> > > also from jpp, the AltM adverb takes a gerund to apply cyclically to
> such
> > > an above cut structure.
> > >
> > >
> > > a:"_`u AltM would produce empties for non-odd positions.
> > >
> > > But if you only care about the selections, then either regex, or a ;:
> > > definition can extract them.
> > >
> > >
> > > ________________________________
> > > From: Danil Osipchuk <[email protected]>
> > > To: Programming forum <[email protected]>
> > > Sent: Sunday, June 11, 2017 7:19 AM
> > > Subject: [Jprogramming] Apply at start/lengths pairs
> > >
> > >
> > >
> > > Hi all,
> > >
> > >
> > > I wonder if there is an idiomatic way to apply a verb using an array of
> > >
> > > start and length pairs. This is a recurring pattern when extracting
> data
> > >
> > > from files.
> > >
> > > I've tried 3 adverbs (the example at the end), and the first one is
> > >
> > > slightly better on big files, but I'm still looking for possible
> > >
> > > improvements (the need is to extract selected fields from
> multi-gigabyte
> > >
> > > memory mapped csv/xml files)
> > >
> > >
> > > 'ab' xmlTagContentSL XML
> > >
> > >
> > > 11 1
> > >
> > >
> > > 22 2
> > >
> > >
> > > 34 3
> > >
> > >
> > > 47 4
> > >
> > >
> > > 'ab' <xmlTagDo XML
> > >
> > >
> > > +-+--+---+----+
> > >
> > >
> > > |1|20|300|4000|
> > >
> > >
> > > +-+--+---+----+
> > >
> > >
> > > (2 2 $ 'ab'xmlTagContentSL XML) <doSL XML
> > >
> > >
> > > +---+----+
> > >
> > >
> > > |1 |20 |
> > >
> > >
> > > +---+----+
> > >
> > >
> > > |300|4000|
> > >
> > >
> > > +---+----+
> > >
> > >
> > >
> > > regards,
> > >
> > > Danil
> > >
> > >
> > > doSL =: 1 : '(,."1@[)u;.0]' NB. SL stands for start len pair
> > >
> > >
> > > NB. doSL =: 1 : '(0|:[:,:[)u;.0]'
> > >
> > >
> > > NB. doSL =: 1 : '(u;.0~ ,.)~"1'
> > >
> > >
> > >
> > > xmlTagOpn =: '<' ,'>',~]
> > >
> > >
> > > xmlTagCls =: '</','>',~]
> > >
> > >
> > >
> > > xmlTagContentSL =: 4 : 0
> > >
> > >
> > > CS =. (xmlTagOpn >x) (#@[ + I.@E.) y
> > >
> > >
> > > CE =. (xmlTagCls >x) I.@E. y
> > >
> > >
> > > CS ,. CE-CS
> > >
> > >
> > > )
> > >
> > >
> > >
> > > xmlTagDo =: 1 : '(xmlTagContentSL (u doSL) ])f.'
> > >
> > >
> > >
> > > XML =: 0 : 0
> > >
> > >
> > > <data>
> > >
> > >
> > > <ab>1</ab>
> > >
> > >
> > > <ab>20</ab>
> > >
> > >
> > > <ab>300</ab>
> > >
> > >
> > > <ab>4000</ab>
> > >
> > >
> > > </data>
> > >
> > >
> > > )
> > >
> > > ----------------------------------------------------------------------
> > >
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> > > ----------------------------------------------------------------------
> > > For information about J forums see http://www.jsoftware.com/forums.htm
> > >
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
> >
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to