Re: [Jprogramming] Apply at start/lengths pairs

Henry Rich Sun, 11 Jun 2017 17:17:13 -0700

There has been a good bit written about this, and some of it has made itinto System/Interpreter/Requests. If you are asking for more than isasked for there, please add to that page.


Henry Rich


On 6/11/2017 10:17 AM, 'Pascal Jasmin' via Programming wrote:

emit empty word is strongly needed IMO.

I will probably write a mini-regex implementation for ;: (with *?+ support) 
suitable for tag extraction that should be faster than regex.

But the basics of that are to start word after opening string, use ev (no 
start) at start of closing string, but then ew with no start when end of 
closing string.

But have you confirmed that regex is too slow for tag extraction?




________________________________
From: Danil Osipchuk <[email protected]>
To: Programming forum <[email protected]>
Sent: Sunday, June 11, 2017 9:59 AM
Subject: Re: [Jprogramming] Apply at start/lengths pairs



I could not find a cutP definition with a quick look, but from your example
it seems like you mean a character separator by token. It is not general
enough.

Also, imagine a fluffy xml file, with millions of records, where only a
minority of fields of different type in records are interesting, some nodes
have missing fields of the type you are interested in.
Parsing the whole file is plainly unfeasible because of performance and
complexity of the resulting code.

Applying at selected positions obtained and reshaped by whatever means
however works rather well and is easy to reason about.

As about fsm, I do remember that I found ;: inconvenient when I was trying
to apply it - and one issue was that there is no way to emit an empty word.
The other was that you have to have every possible value of input domain
represented as a row. When the domain are characters the mapping is
manageable, for everything else - not so much. As a vague idea, if there
was a way to condense the input domain through a verb, possibly a dyad to
pass an additional state, it would considerably expand the use of ;: with
some performance hit of course.
Also the code utilizing ;: is pretty much unreadable even by J standards.
That was my initial impressions about it.


2017-06-11 15:43 GMT+03:00 'Pascal Jasmin' via Programming <
[email protected]>:

A more general procedure than your request is to cut your data such that
your start/end segments are in odd positions


in jpp, https://github.com/Pascal-J/jpp

cutP is a process for cutting on start and end tokens, though there are
faster methods in included fsm.ijs file.  And that process could get
significant boost if ;: were enhanced to support emitting empty boxes, but:

cutP '(asdf)g()'
++----+-+++
||asdf|g|||
++----+-+++

cutP is dyadic for start and end tokens other than '()'.


also from jpp, the AltM adverb takes a gerund to apply cyclically to such
an above cut structure.


a:"_`u AltM would produce empties for non-odd positions.

But if you only care about the selections, then either regex, or a ;:
definition can extract them.


________________________________
From: Danil Osipchuk <[email protected]>
To: Programming forum <[email protected]>
Sent: Sunday, June 11, 2017 7:19 AM
Subject: [Jprogramming] Apply at start/lengths pairs



Hi all,


I wonder if there is an idiomatic way to apply a verb using an array of

start and length pairs. This is a recurring pattern when extracting data

from files.

I've tried 3 adverbs (the example at the end), and the first one is

slightly better on big files, but I'm still looking for possible

improvements (the need is to extract selected fields from multi-gigabyte

memory mapped csv/xml files)


'ab' xmlTagContentSL XML


11 1


22 2


34 3


47 4


'ab' <xmlTagDo XML


+-+--+---+----+


|1|20|300|4000|


+-+--+---+----+


(2 2 $ 'ab'xmlTagContentSL XML) <doSL XML


+---+----+


|1 |20 |


+---+----+


|300|4000|


+---+----+



regards,

Danil


doSL =: 1 : '(,."1@[)u;.0]' NB. SL stands for start len pair


NB. doSL =: 1 : '(0|:[:,:[)u;.0]'


NB. doSL =: 1 : '(u;.0~ ,.)~"1'



xmlTagOpn =: '<' ,'>',~]


xmlTagCls =: '</','>',~]



xmlTagContentSL =: 4 : 0


CS =. (xmlTagOpn >x) (#@[ + I.@E.) y


CE =. (xmlTagCls >x) I.@E. y


CS ,. CE-CS


)



xmlTagDo =: 1 : '(xmlTagContentSL (u doSL) ])f.'



XML =: 0 : 0


<data>


<ab>1</ab>


<ab>20</ab>


<ab>300</ab>


<ab>4000</ab>


</data>


)

----------------------------------------------------------------------

For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm



---
This email has been checked for viruses by AVG.
http://www.avg.com

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] Apply at start/lengths pairs

Reply via email to