Re: [Jchat] feedback on parsing file approach

Joe Bogner Sun, 12 Jan 2014 05:36:19 -0800

I like the approach of classifying the text. It seems more generic and
allows different metrics to be captured outside of the scanning
routine.


I went down this path and got stuck because I think my state needs to
also have knowledge of the bracket depth. Here is my failed attempt at
thinking through it:

Starting simpler, with the @{ } blocks and ignoring the script tags for a moment

I think the sequential machine needs to do something like this for
each character

s is the current state
c is the current character

initial state = S_NONE = 0
S_AT = 1
S_CODE = 2
S_LEFT = 3
S_RIGHT = 4

A series of functions that return the state transitions, otherwise use
the current state

(s=NONE & c='@') -> S_AT                       NB. Trigger start @{
(s=S_AT & c='{') -> S_CODE                     NB. Identify @{
(s=S_CODE & c='{') -> S_LEFT                 NB. Trigger starting brace if () {
(s=S_LEFT & c='}') -> S_RIGHT
(s=S_RIGHT) -> S_CODE                          NB. Needed for ending brace
(s=S_CODE & c='}') -> S_NONE

That would handle this case

@{ foo; if (abc) { q; } }

It would be something like

1222222...3333333420

Assuming it was properly classified, I could count the depth by
counting up the consecutive 3s and 2s before a 4s. I could count the
block size by numbers greater than 0.

I could count the blocks by pairs of 12 (or something like that)

Depth of brackets has me stuck conceptually as-in

@{ foo; if (abc) { if (q) { m; } } }

1222222...33333333333333420000

I manually assigned those numbers so they are probably wrong. It feels
like I need another state for code within a brace

I might be able to come up with a table if there's a state for each
brace depth or use ranges and greater than/less than

S_BRACE_L_1, S_BRACE_L_2 and S_BRACE_R_1 and SBRACE_R_2 but that
doesn't seem practical

Or let the state transition logic use globals and let the state
transition function assign globals

This is where I got stuck.

On Sun, Jan 12, 2014 at 6:50 AM, Joe Bogner <[email protected]> wrote:
> Thanks for the sequential machine implementation. I tested with
> different versions of the text block and it doesn't work as I
> expected, which means I either relayed the requirements wrong or there
> may be a bug
>
> For example, if I take out the first block of @{ }, it reports
>
>    calc2 text
> blocks 0
> max depth 1
> max block 25
> scripts 2
> max script 49
>
> text =: 0 : 0
> @{
> Response.Write('start');
> }
> <html>
> <script>
> alert('start');
> </script>
> <div id='Foo'>@Page.Foo</div>
> <script>
> alert($('#Foo').val());
> </script>
>
> </html>
> @{
> Response.Write('bye');
> }
> )
>
> My implementation posts the correct answer of two blocks - each pair
> of @{ and the } that gets back to indent = 0.
>
> It looks like yours requires possibly a brace in the block to trigger
> it as a code block.  It also seems to be summing up the total amount
> of code and script characters instead of finding the largest one.
>
> The Trace looks helpful to debug.
>
> I've read through the dictionary and nuvoc a few times for sequential
> machine and I don't understand it well enough to help troubleshoot
> your implementation. I'll spend more time with it. I didn't want to go
> down that rabbit hole until I was sure it could provide a correct
> result.
>
> I thought about posting to programming but was't sure how
> philosophical it would get. Probably better to have started there and
> then migrate here if it was philosophical. Feel free to move it to
> programming since we're now on the details of the sequential machine
> implementation.
>
> Thanks again. I appreciate the opportunity to learn.
>
> On Sat, Jan 11, 2014 at 10:16 PM, Raul Miller <[email protected]> wrote:
>> Here's a draft that uses ;:
>>
>> https://gist.github.com/rdm/8380234
>>
>> (As an aside, perhaps this thread should be on programming? Or at
>> least, something to think about for next time...)
>>
>> Note that I get different character counts than you. Maybe I
>> misunderstood what you intended to count?
>>
>> Let me know if you want me to clarify or rewrite any of that.
>>
>> But, briefly, I am using the final states from a ;: trace to mark the
>> end of each "token" and then classifying the text based on that
>> analysis. Since this sequential machine is a bit bulky, I decided to
>> write a small application to build it rather than constructing it by
>> hand. Since I only care about the state trace, I use no-op for all
>> operations. Since I want the end state, I use 0 _1 0 0 for ijrd
>> instead of the default 0 _1 0 _1. This leaves me with my final state
>> being the "character position" after the last character in text (and
>> it's reported in the trace rather than being an error condition).
>>
>> Thanks,
>>
>> --
>> Raul
>>
>> On Sat, Jan 11, 2014 at 4:47 PM, Joe Bogner <[email protected]> wrote:
>>> Thank you for the thoughts. You summarized it well.
>>>
>>> I don't need to worry about attributes on the script tag for this use case.
>>> I am interested in quantifying how much embedded javascript is in each of
>>> the pages. I don't need to quantify external scripts. I know the code base
>>> doesnt use the type="javascript" attribute
>>>
>>> The braces should be well formed otherwise the c# razor file wouldn't
>>> compile. It is possible there may be an edgecase which can be found when I
>>> run it against all the files.
>>>
>>> I plan to use it to identify areas to refactor in the javascript/c# razor
>>> code base and then watch it improve over time. I also thought it would be
>>> interesting to use a concise and expressive language, J, to measure the
>>> more verbose  code base. It doesn’t need to be precise in terms of
>>> characters. For example, it is ok if the script tag characters are counted
>>> as long as its consistent. I will be using it find large problem areas and
>>> then measure the improvement.
>>>
>>> I would be interested in seeing the sequential machine approach or any
>>> other more idiomatic method than mine. I am fairly satisfied with mine. It
>>> is fairly clear to me and can likely ne extended if needed. I am trying to
>>> use J more in my day to day and that would help me learn and hopefully
>>> would be an interesting example for others.
>>>
>>> Thanks again
>>> On Jan 11, 2014 4:11 PM, "Raul Miller" <[email protected]> wrote:
>>>
>>>> I think I see how I would do that with a sequential machine. Let me
>>>> know if you want a working example.
>>>>
>>>> Briefly, though, you seem to have three kinds of token pairs:
>>>>
>>>> @{   }
>>>> {  }
>>>> <script> </script>
>>>>
>>>> The ambiguity between the first two is problematic, in the context of
>>>> errors, but does not matter in well formed cases. A bigger problem in
>>>> the wild might be that you do not allow for attributes on the script
>>>> tag.
>>>>
>>>> Also, you care about the number of characters between <script>
>>>> </script> so those characters should be saved as "tokens" even if they
>>>> are not curly braces. You care about {} between both @{ } and <script>
>>>> </script> and outside them, and your implementation allows things like
>>>> @{ <script> } </script>.
>>>>
>>>> A full wart-for-wart compatible version would be painful to write. A
>>>> version which assumed well-formed cases would be much easier to write.
>>>> But before thinking about coding up an implementation it's probably
>>>> worth thinking about why you want to do this. The answer to that kind
>>>> of question can be really interesting and can help identify which
>>>> warts are unnecessary or possibly even detrimental.
>>>>
>>>> So, before I think any more about code, what are your thoughts on what
>>>> you want to accomplish?
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Raul
>>>>
>>>>
>>>> On Sat, Jan 11, 2014 at 3:40 PM, Joe Bogner <[email protected]> wrote:
>>>> > I have about 300 code files (javascript and embedded code) that I want
>>>> > to collect some metrics on.  I've written the algorithm using an
>>>> > imperative style. I actually wrote it first in C# and translated to J
>>>> >
>>>> > Here is the code (posted a link for brevity):
>>>> >
>>>> > J version:
>>>> > https://gist.github.com/joebo/936ca5e2017c0a3b5c56
>>>> >
>>>> > C# version:
>>>> > https://gist.github.com/joebo/e7f8e3ca7bd21117e58d
>>>> >
>>>> > This is what it outputs
>>>> >
>>>> > calc''
>>>> > blocks 3
>>>> > max depth 2
>>>> > max block 113
>>>> > scripts 2
>>>> > max script 26
>>>> >
>>>> > Any suggestions on how to do it differently in J? I looked into the
>>>> > sequential machine some but couldn't figure out how to make it work
>>>> > (if it could) since my approach required knowledge of the brace depth.
>>>> >
>>>> > In terms of requirements:
>>>> > 1. Take a block of text
>>>> > 2. Identify the code blocks in the file (start with @{ and end with } )
>>>> > 3. Count the code blocks
>>>> > 4. Determine the max depth of the code block
>>>> > 5. Determine the max size of all the code blocks
>>>> > 6. Count the javascript blocks
>>>> > 7. Determine the max size of the javascript block
>>>> >
>>>> > Thanks for any feedback or input!
>>>> >
>>>> > Joe
>>>> > ----------------------------------------------------------------------
>>>> > For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] feedback on parsing file approach

Reply via email to