Re: [Jprogramming] Apply at start/lengths pairs

Raul Miller Mon, 12 Jun 2017 05:59:34 -0700

Which xml utilities or library? I though J had stopped supporting xml
implementations?


-- 
Raul

On Sun, Jun 11, 2017 at 10:24 PM, bill lam <[email protected]> wrote:
> your regex is not an exact duplicate of your J. In your J, opn and cls are
> searched separately and allows malformed tags, whereas your regex searches
> opn and cls pair together. That said, I would try xml utilities or library
> if the size of xml data is large.
>
> On 11 Jun, 2017 11:17 pm, "Danil Osipchuk" <[email protected]> wrote:
>
> Frankly I even didn't consider to use regexp for parsing of production data
> (gigabytes/millions of records). Regexp must have a plenty of overhead to
> pass  back and forth between the library and j engine, not to mention the
> whole complex logic supporting its complex semantics. This is all compared
> to a simple I.@E. which I believe to be highly polished. I'm not sure
> regexp will work for GB files at all.
>
> I've just tried regexp on a toy chunk of data I have at hand, indeed it is
> a lot slower.
>
> dir''
> ...
>
> jqt.sh 126 06-Jan-17 07:25:22
>
> mi-1b.xml 222153 05-Mar-17 22:34:10
> ...
>
> 6!:2 'a=: (<''transmittedOctets'') #xmlTagDo fread''mi-1b.xml'' '
>
> 0.000931
>
> rx=: rxcomp '\<transmittedOctets\>(.*)\</transmittedOctets\>'
>
> 6!:2 'b =: #@> (rx;,1) rxall fread ''mi-1b.xml'' '
>
> 0.064589
>
> a-:b
>
> 1
>
> 6!:2 'b =: (rx;,1) #rxapply fread ''mi-1b.xml'' '
>
> |domain error
>
> | ;,({."1 p),.(#p){.(#m)$x
>
>
>
>
>
>
> 2017-06-11 17:17 GMT+03:00 'Pascal Jasmin' via Programming <
> [email protected]>:
>
>> emit empty word is strongly needed IMO.
>>
>> I will probably write a mini-regex implementation for ;: (with *?+
>> support) suitable for tag extraction that should be faster than regex.
>>
>> But the basics of that are to start word after opening string, use ev (no
>> start) at start of closing string, but then ew with no start when end of
>> closing string.
>>
>> But have you confirmed that regex is too slow for tag extraction?
>>
>>
>>
>>
>> ________________________________
>> From: Danil Osipchuk <[email protected]>
>> To: Programming forum <[email protected]>
>> Sent: Sunday, June 11, 2017 9:59 AM
>> Subject: Re: [Jprogramming] Apply at start/lengths pairs
>>
>>
>>
>> I could not find a cutP definition with a quick look, but from your
> example
>> it seems like you mean a character separator by token. It is not general
>> enough.
>>
>> Also, imagine a fluffy xml file, with millions of records, where only a
>> minority of fields of different type in records are interesting, some
> nodes
>> have missing fields of the type you are interested in.
>> Parsing the whole file is plainly unfeasible because of performance and
>> complexity of the resulting code.
>>
>> Applying at selected positions obtained and reshaped by whatever means
>> however works rather well and is easy to reason about.
>>
>> As about fsm, I do remember that I found ;: inconvenient when I was trying
>> to apply it - and one issue was that there is no way to emit an empty
> word.
>> The other was that you have to have every possible value of input domain
>> represented as a row. When the domain are characters the mapping is
>> manageable, for everything else - not so much. As a vague idea, if there
>> was a way to condense the input domain through a verb, possibly a dyad to
>> pass an additional state, it would considerably expand the use of ;: with
>> some performance hit of course.
>> Also the code utilizing ;: is pretty much unreadable even by J standards.
>> That was my initial impressions about it.
>>
>>
>> 2017-06-11 15:43 GMT+03:00 'Pascal Jasmin' via Programming <
>> [email protected]>:
>>
>> > A more general procedure than your request is to cut your data such that
>> > your start/end segments are in odd positions
>> >
>> >
>> > in jpp, https://github.com/Pascal-J/jpp
>> >
>> > cutP is a process for cutting on start and end tokens, though there are
>> > faster methods in included fsm.ijs file.  And that process could get
>> > significant boost if ;: were enhanced to support emitting empty boxes,
>> but:
>> >
>> > cutP '(asdf)g()'
>> > ++----+-+++
>> > ||asdf|g|||
>> > ++----+-+++
>> >
>> > cutP is dyadic for start and end tokens other than '()'.
>> >
>> >
>> > also from jpp, the AltM adverb takes a gerund to apply cyclically to
> such
>> > an above cut structure.
>> >
>> >
>> > a:"_`u AltM would produce empties for non-odd positions.
>> >
>> > But if you only care about the selections, then either regex, or a ;:
>> > definition can extract them.
>> >
>> >
>> > ________________________________
>> > From: Danil Osipchuk <[email protected]>
>> > To: Programming forum <[email protected]>
>> > Sent: Sunday, June 11, 2017 7:19 AM
>> > Subject: [Jprogramming] Apply at start/lengths pairs
>> >
>> >
>> >
>> > Hi all,
>> >
>> >
>> > I wonder if there is an idiomatic way to apply a verb using an array of
>> >
>> > start and length pairs. This is a recurring pattern when extracting data
>> >
>> > from files.
>> >
>> > I've tried 3 adverbs (the example at the end), and the first one is
>> >
>> > slightly better on big files, but I'm still looking for possible
>> >
>> > improvements (the need is to extract selected fields from multi-gigabyte
>> >
>> > memory mapped csv/xml files)
>> >
>> >
>> > 'ab' xmlTagContentSL XML
>> >
>> >
>> > 11 1
>> >
>> >
>> > 22 2
>> >
>> >
>> > 34 3
>> >
>> >
>> > 47 4
>> >
>> >
>> > 'ab' <xmlTagDo XML
>> >
>> >
>> > +-+--+---+----+
>> >
>> >
>> > |1|20|300|4000|
>> >
>> >
>> > +-+--+---+----+
>> >
>> >
>> > (2 2 $ 'ab'xmlTagContentSL XML) <doSL XML
>> >
>> >
>> > +---+----+
>> >
>> >
>> > |1 |20 |
>> >
>> >
>> > +---+----+
>> >
>> >
>> > |300|4000|
>> >
>> >
>> > +---+----+
>> >
>> >
>> >
>> > regards,
>> >
>> > Danil
>> >
>> >
>> > doSL =: 1 : '(,."1@[)u;.0]' NB. SL stands for start len pair
>> >
>> >
>> > NB. doSL =: 1 : '(0|:[:,:[)u;.0]'
>> >
>> >
>> > NB. doSL =: 1 : '(u;.0~ ,.)~"1'
>> >
>> >
>> >
>> > xmlTagOpn =: '<' ,'>',~]
>> >
>> >
>> > xmlTagCls =: '</','>',~]
>> >
>> >
>> >
>> > xmlTagContentSL =: 4 : 0
>> >
>> >
>> > CS =. (xmlTagOpn >x) (#@[ + I.@E.) y
>> >
>> >
>> > CE =. (xmlTagCls >x) I.@E. y
>> >
>> >
>> > CS ,. CE-CS
>> >
>> >
>> > )
>> >
>> >
>> >
>> > xmlTagDo =: 1 : '(xmlTagContentSL (u doSL) ])f.'
>> >
>> >
>> >
>> > XML =: 0 : 0
>> >
>> >
>> > <data>
>> >
>> >
>> > <ab>1</ab>
>> >
>> >
>> > <ab>20</ab>
>> >
>> >
>> > <ab>300</ab>
>> >
>> >
>> > <ab>4000</ab>
>> >
>> >
>> > </data>
>> >
>> >
>> > )
>> >
>> > ----------------------------------------------------------------------
>> >
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>>
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> >
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Apply at start/lengths pairs

Reply via email to