Re: [Jchat] feedback on parsing file approach

Raul Miller Sun, 12 Jan 2014 11:59:23 -0800

Yes, nicely done.

And, really, 1234 could appear in every row of the States table. Right
now it's only there for where tokens end, and that works because
you're not going to be using malformed or partially formed script
tags. 1234 in every row is necessary though, for your current examples
and requirements.


Thanks,

-- 
Raul

On Sun, Jan 12, 2014 at 2:48 PM, Joe Bogner <[email protected]> wrote:
> I think I have the state table figured out. I created a little image
> to help explain it.
>
> http://imgur.com/JT7MAhj
>
> I may post this to the wiki as an example. Thanks again
>
>
>
> On Sun, Jan 12, 2014 at 1:49 PM, Joe Bogner <[email protected]> wrote:
>> Great! Yes, that change counts the blocks the way I need it to.
>>
>> As you pointed out, the requirements weren't very well spec'd and led
>> to ambiguities. I had thought the implementation was relatively clear
>> and it wasn't a good assumption to think that it would be read. I
>> think tests would have worked well to more clearly illustrate the
>> expectations. I couldn't imagine writing a parser without them.
>>
>> Raul wrote:
>>> A related issue is that your measure of "depth" did not include @{ } -
>>> both immediately inside and immediately outside these tokens is depth
>>> 0, if I understand properly what you meant by depth.
>>
>> Yes, that's fine. My goal was to measure how deep the nesting was. It
>> doesn't matter to me if its zero or one based.
>>
>> Thanks for the tips on how to incorporate logic for detecting in { } in 
>> strings.
>>
>> We actually weren't that far off on understanding considering how
>> little you needed to change to count blocks the way I had intended.
>>
>> The max blocks still isn't right, but that's OK. I will see if I can
>> fix it or start writing some tests to demonstrate it better.
>>
>> text =: 0 : 0
>> @{ if (foo) { } }
>> )
>>
>> shows max block 2
>>
>> text =: 0 : 0
>> @{ Response.Write("hi"); }
>> )
>>
>> shows max block 0, which leads me to believe it needs a brace inside
>> the code block to start counting. I would have assumed it would be
>> some number of characters close to # ' Response.Write("hi"); '
>>
>> 23
>>
>> I should be able to figure out most of how you did it but I'm stumped
>> on the State table. I think I understand classify except for one part
>> (summarized below)
>>
>>
>> classify=: 1}.Ends i. 2 {"1(5;(States,"+0);((<"+Chars),<a.-.Chars);0
>> _1 0 0) ;: ]
>>
>> Working right to left
>>
>> NB. I can figure this out later. Your explanation is good and the
>> dictionary covers it
>> ijrd =. 0 _1 0 0
>>
>> NB. The first part makes sense, the second part looks to be bunch of
>> junk characters?
>> m=.((<"+Chars),<a.-.Chars)
>>
>> In a trivial example, it looks like it classifies the rows the same way:
>>
>>  m=.((<"+Chars))
>>  (y i.~;m) { (#m),~(#&>m)#i.#m NB. From the dictionary entry
>>
>> 0 1 12 12 12 5 9 12 12 5 12 12 12 7 8 10 12 12 12 12 8 12 12 12 12 2 12
>>
>> m=.((<"+Chars),<a.-.Chars)
>> (y i.~;m) { (#m),~(#&>m)#i.#m NB. From the dictionary entry
>>
>> 0 1 12 12 12 5 9 12 12 5 12 12 12 7 8 10 12 12 12 12 8 12 12 12 12 2 12
>>
>> What is the purpose of <a.-.Chars? Is that "every other character"
>> than what was specified?
>>
>> s=:(States,"+0)
>>
>> This adds the 0 operation to each of the States per your earlier note
>> of using 0 to no-op
>>
>> f=: 5
>> (f;s;m;ijrd) ;: text
>>
>>
>> NB. extract the 3rd column from the trace
>> cols=:2 {"1(f;s;m;ijrd) ;: text
>> 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
>>
>> NB. turn the col back into the token number from appendToken
>> Ends i. cols
>>
>>
>> I think I understand all of that.
>>
>> The State table has a shape of 21 13, which is the length of the text
>> token we are looking for on the rows and length of the characters that
>> make up those tokens on the columns.
>>
>> # '@{}<script></script>'
>> 20
>>
>> # Chars
>> 12
>>
>> I took a stab at adding the character and token on the x & y axis. I
>> don't think I have it lined up quite right and I'm sure it doesn't
>> look great on e-mail. If you can help decrypt the table that would be
>> helpful as I am not following completely what appendToken is doing to
>> build it.
>>
>>
>>
>>   @ { } <  /  s  c  r  i  p  t  >
>> @ 1 2 3 4  0  0  0  0  0  0  0  0 0
>> { 1 2 3 4  0  0  0  0  0  0  0  0 0
>> } 1 2 3 4  0  0  0  0  0  0  0  0 0
>> < 1 2 3 4  0  0  0  0  0  0  0  0 0
>> s 0 0 0 0 13  5  0  0  0  0  0  0 0
>> c 0 0 0 0  0  0  6  0  0  0  0  0 0
>> r 0 0 0 0  0  0  0  7  0  0  0  0 0
>> i 0 0 0 0  0  0  0  0  8  0  0  0 0
>> p 0 0 0 0  0  0  0  0  0  9  0  0 0
>> t 0 0 0 0  0  0  0  0  0  0 10  0 0
>>> 0 0 0 0  0  0  0  0  0  0  0 11 0
>> < 1 2 3 4  0  0  0  0  0  0  0  0 0
>> / 0 0 0 0  0  0  0  0  0  0  0  0 0
>> s 0 0 0 0  0 14  0  0  0  0  0  0 0
>> c 0 0 0 0  0  0 15  0  0  0  0  0 0
>> r 0 0 0 0  0  0  0 16  0  0  0  0 0
>> i 0 0 0 0  0  0  0  0 17  0  0  0 0
>> p 0 0 0 0  0  0  0  0  0 18  0  0 0
>> t 0 0 0 0  0  0  0  0  0  0 19  0 0
>>> 0 0 0 0  0  0  0  0  0  0  0 20 0
>>  1 2 3 4  0  0  0  0  0  0  0  0 0
>>
>> States 5,6,7,8,9,10,11 must be used to track <script> or
>> 14,15,16,17,18,19,20 does it.
>>
>> Not sure why there isn't a state 12
>>
>> Any guidance on the table would be appreciated. This is really cool.
>> Thanks again
>>
>>
>> On Sun, Jan 12, 2014 at 12:16 PM, Raul Miller <[email protected]> wrote:
>>> Ok...
>>>
>>> Translating what I think you are saying into implementation, I think
>>> you want to change
>>>
>>> smoutput 'blocks ',":Left -&(+/) Codes
>>>
>>> to
>>>
>>> smoutput 'blocks ',":+/ Codes
>>>
>>> "Left" was bits marking left curly brackets (there were 6 in your
>>> sample text) while Codes was bits marking instances of @{  (there were
>>> 3 in your sample text).
>>>
>>> A related issue is that your measure of "depth" did not include @{ } -
>>> both immediately inside and immediately outside these tokens is depth
>>> 0, if I understand properly what you meant by depth.
>>>
>>> You will note, here, that I did not actually read your code very
>>> closely - that is because I was more interested in paraphrasing it
>>> than in copying it, and that means understanding what you were
>>> thinking more than understand what you implemented. We sometimes
>>> approximate this process using requirements, sometimes using tests and
>>> perhaps in a variety of other ways.
>>>
>>> Also, it might help you to understand the code better if you replaced
>>> every =. in calc2 with =: (=. is great for isolating internal
>>> definitions in explicit verbs, but =: is much better for making things
>>> visible or ... explicit?).
>>>
>>> That said, we can exclude { and } which appear in irrelevant contexts
>>> by first declaring what those contexts are (double quoted strings?
>>> multi-line comments? single line comments?) and then adjusting the
>>> definition of State to distinguish them from the recognized instances
>>> of { and }.
>>>
>>> Let's say that I wanted to exclude { in double quoted strings. Here's
>>> an outline:
>>>
>>> (1) Include " in the definition of Chars
>>> (2) Introduce a new routine appendTokenPair which works like
>>> appendToken but leaves the sequential machine in an alternate state
>>> until receiving a second token.
>>> (3) use this new routine to include " ... " in our definition of State.
>>>
>>> Once this was working, using it for /* ... */ should be trivial,
>>> though the use of multi-character tokens might be an issue, depending
>>> on how appendTokenPair was implemented.
>>>
>>> The thing you need to watch out for, when working with parsers, is
>>> ambiguities. In this example, we had an ambiguity between @{ and {
>>> where hypothetically speaking they might be confused. This was one of
>>> my motivations for focusing on requirements instead of simply diving
>>> into the implementation.
>>>
>>> Being able to move from implementation to specification is not easy -
>>> I love focusing on the computer and I sometimes find human
>>> interactions painful (I do not like bothering people and while I might
>>> occasionally enjoy getting yelled at I find I need to do something to
>>> please people yelling at me after - or at least something I am
>>> comfortable interpreting as pleasing - going away seems to count,
>>> somehow. Mostly, though, I have a lot of respect for heads-down focus,
>>> even when it's taken too far.)
>>>
>>> Does this make sense?
>>>
>>> Thanks,
>>>
>>> --
>>> Raul
>>>
>>>
>>>
>>> On Sun, Jan 12, 2014 at 11:13 AM, Joe Bogner <[email protected]> wrote:
>>>> Sorry about that. My requirements were based on more contextual
>>>> knowledge than it probably should have.
>>>>
>>>> To take a step back:
>>>>
>>>> In the the c#/razor template language, each code block is delimited by:
>>>>
>>>> @{
>>>>
>>>> }
>>>>
>>>> Within a block, you can add c# code to perform any functions of your
>>>> page necessary
>>>>
>>>> @{
>>>>        if (Post) {
>>>>             Save();
>>>>        } else {
>>>>            DoSomethingElse();
>>>>       }
>>>> }
>>>>
>>>> A page can have multiple code blocks. And a code block can have an
>>>> infinite depth of branching, denoted by { }
>>>>
>>>> Poor code would have many blocks, or very large blocks or very deep 
>>>> nesting.
>>>>
>>>> @{
>>>>        if (Post) {
>>>>             if (Monday) {
>>>>                  if (After5PM) {
>>>>                          if (Before8PM) {
>>>>                             Save();
>>>>                        }
>>>>                  }
>>>>             }
>>>>
>>>>        } else {
>>>>            DoSomethingElse();
>>>>       }
>>>> }
>>>>
>>>>
>>>> A code block is pairs of @{ } where } terminates after the branch
>>>> level is zero. Let me know if that's not clear enough. Other
>>>> templating languages like php make it easier.
>>>>
>>>> <? php
>>>>
>>>> if (Foo) {
>>>>
>>>> }
>>>>
>>>> ?>
>>>> <html>foo</html>
>>>>
>>>>
>>>> In PHP, you wouldn't need to worry about the curly brace depth for
>>>> determining code block start and end. It could be split on <?php ?>
>>>>
>>>> In razor, the @{ is the same as <? and the } when brace depth is zero
>>>> terminates the block
>>>>
>>>> So I don't have an exact specification that I'm working towards. I'm
>>>> just trying to find out how many @{ } code blocks there are, how
>>>> deeply nested the code within is, and how large the largest block is.
>>>> For example, if it's more than 20 lines or X characters, it probably
>>>> belongs in a separate class or file
>>>>
>>>> Of course an edge case that would blow up would be if the code block
>>>> has a brace in a string
>>>>
>>>> @{
>>>>          if (Post) {
>>>>               Response.Write("will break a simple parser } } }} ");
>>>>          }
>>>> }
>>>> I don't think that would be extensive in this code. It's not going to
>>>> be used for anything of a critical nature other than to help improve
>>>> my personal code base - so if there are false positives or errors it
>>>> OK. I'm looking for a "good enough" solution.
>>>>
>>>> Hope that helps. Feel free to cancel if I'm not getting progressively
>>>> more clear or if the problem is uninteresting to help solve.
>>>>
>>>> Thanks again
>>>>
>>>> Joe
>>>>
>>>>
>>>>
>>>> On Sun, Jan 12, 2014 at 10:56 AM, Raul Miller <[email protected]> 
>>>> wrote:
>>>>> I quite possibly misunderstood your specifications.
>>>>>
>>>>> If I simply remove lines 2 and 11 from my gist, calc2 still reports
>>>>> three blocks. If I also remove the three blocks which appear between
>>>>> lines 2 and 11, calc2 will then report 0 blocks. Is that not what you
>>>>> wanted me to count?
>>>>>
>>>>> Meanwhile, I do not concern myself very much with whether the
>>>>> boundaries of a region of text are "inside" or "outside" that region.
>>>>> Instead, I go with what seems simple to implement and then use the
>>>>> requirements to tweak the code so that the result is correct. Of
>>>>> course, the limitation here is that I need to understand your
>>>>> requirements. Another limitation is that new requirements will require
>>>>> new code (or manual work) - but that seems to me to be unavoidable.
>>>>>
>>>>> I expect that once we share an understanding of your requirements that
>>>>> an explanation of how the code is structured will make more sense.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Raul
>>>>>
>>>>>
>>>>> On Sun, Jan 12, 2014 at 6:50 AM, Joe Bogner <[email protected]> wrote:
>>>>>> Thanks for the sequential machine implementation. I tested with
>>>>>> different versions of the text block and it doesn't work as I
>>>>>> expected, which means I either relayed the requirements wrong or there
>>>>>> may be a bug
>>>>>>
>>>>>> For example, if I take out the first block of @{ }, it reports
>>>>>>
>>>>>>    calc2 text
>>>>>> blocks 0
>>>>>> max depth 1
>>>>>> max block 25
>>>>>> scripts 2
>>>>>> max script 49
>>>>>>
>>>>>> text =: 0 : 0
>>>>>> @{
>>>>>> Response.Write('start');
>>>>>> }
>>>>>> <html>
>>>>>> <script>
>>>>>> alert('start');
>>>>>> </script>
>>>>>> <div id='Foo'>@Page.Foo</div>
>>>>>> <script>
>>>>>> alert($('#Foo').val());
>>>>>> </script>
>>>>>>
>>>>>> </html>
>>>>>> @{
>>>>>> Response.Write('bye');
>>>>>> }
>>>>>> )
>>>>>>
>>>>>> My implementation posts the correct answer of two blocks - each pair
>>>>>> of @{ and the } that gets back to indent = 0.
>>>>>>
>>>>>> It looks like yours requires possibly a brace in the block to trigger
>>>>>> it as a code block.  It also seems to be summing up the total amount
>>>>>> of code and script characters instead of finding the largest one.
>>>>>>
>>>>>> The Trace looks helpful to debug.
>>>>>>
>>>>>> I've read through the dictionary and nuvoc a few times for sequential
>>>>>> machine and I don't understand it well enough to help troubleshoot
>>>>>> your implementation. I'll spend more time with it. I didn't want to go
>>>>>> down that rabbit hole until I was sure it could provide a correct
>>>>>> result.
>>>>>>
>>>>>> I thought about posting to programming but was't sure how
>>>>>> philosophical it would get. Probably better to have started there and
>>>>>> then migrate here if it was philosophical. Feel free to move it to
>>>>>> programming since we're now on the details of the sequential machine
>>>>>> implementation.
>>>>>>
>>>>>> Thanks again. I appreciate the opportunity to learn.
>>>>>>
>>>>>> On Sat, Jan 11, 2014 at 10:16 PM, Raul Miller <[email protected]> 
>>>>>> wrote:
>>>>>>> Here's a draft that uses ;:
>>>>>>>
>>>>>>> https://gist.github.com/rdm/8380234
>>>>>>>
>>>>>>> (As an aside, perhaps this thread should be on programming? Or at
>>>>>>> least, something to think about for next time...)
>>>>>>>
>>>>>>> Note that I get different character counts than you. Maybe I
>>>>>>> misunderstood what you intended to count?
>>>>>>>
>>>>>>> Let me know if you want me to clarify or rewrite any of that.
>>>>>>>
>>>>>>> But, briefly, I am using the final states from a ;: trace to mark the
>>>>>>> end of each "token" and then classifying the text based on that
>>>>>>> analysis. Since this sequential machine is a bit bulky, I decided to
>>>>>>> write a small application to build it rather than constructing it by
>>>>>>> hand. Since I only care about the state trace, I use no-op for all
>>>>>>> operations. Since I want the end state, I use 0 _1 0 0 for ijrd
>>>>>>> instead of the default 0 _1 0 _1. This leaves me with my final state
>>>>>>> being the "character position" after the last character in text (and
>>>>>>> it's reported in the trace rather than being an error condition).
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> --
>>>>>>> Raul
>>>>>>>
>>>>>>> On Sat, Jan 11, 2014 at 4:47 PM, Joe Bogner <[email protected]> wrote:
>>>>>>>> Thank you for the thoughts. You summarized it well.
>>>>>>>>
>>>>>>>> I don't need to worry about attributes on the script tag for this use 
>>>>>>>> case.
>>>>>>>> I am interested in quantifying how much embedded javascript is in each 
>>>>>>>> of
>>>>>>>> the pages. I don't need to quantify external scripts. I know the code 
>>>>>>>> base
>>>>>>>> doesnt use the type="javascript" attribute
>>>>>>>>
>>>>>>>> The braces should be well formed otherwise the c# razor file wouldn't
>>>>>>>> compile. It is possible there may be an edgecase which can be found 
>>>>>>>> when I
>>>>>>>> run it against all the files.
>>>>>>>>
>>>>>>>> I plan to use it to identify areas to refactor in the javascript/c# 
>>>>>>>> razor
>>>>>>>> code base and then watch it improve over time. I also thought it would 
>>>>>>>> be
>>>>>>>> interesting to use a concise and expressive language, J, to measure the
>>>>>>>> more verbose  code base. It doesn’t need to be precise in terms of
>>>>>>>> characters. For example, it is ok if the script tag characters are 
>>>>>>>> counted
>>>>>>>> as long as its consistent. I will be using it find large problem areas 
>>>>>>>> and
>>>>>>>> then measure the improvement.
>>>>>>>>
>>>>>>>> I would be interested in seeing the sequential machine approach or any
>>>>>>>> other more idiomatic method than mine. I am fairly satisfied with 
>>>>>>>> mine. It
>>>>>>>> is fairly clear to me and can likely ne extended if needed. I am 
>>>>>>>> trying to
>>>>>>>> use J more in my day to day and that would help me learn and hopefully
>>>>>>>> would be an interesting example for others.
>>>>>>>>
>>>>>>>> Thanks again
>>>>>>>> On Jan 11, 2014 4:11 PM, "Raul Miller" <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> I think I see how I would do that with a sequential machine. Let me
>>>>>>>>> know if you want a working example.
>>>>>>>>>
>>>>>>>>> Briefly, though, you seem to have three kinds of token pairs:
>>>>>>>>>
>>>>>>>>> @{   }
>>>>>>>>> {  }
>>>>>>>>> <script> </script>
>>>>>>>>>
>>>>>>>>> The ambiguity between the first two is problematic, in the context of
>>>>>>>>> errors, but does not matter in well formed cases. A bigger problem in
>>>>>>>>> the wild might be that you do not allow for attributes on the script
>>>>>>>>> tag.
>>>>>>>>>
>>>>>>>>> Also, you care about the number of characters between <script>
>>>>>>>>> </script> so those characters should be saved as "tokens" even if they
>>>>>>>>> are not curly braces. You care about {} between both @{ } and <script>
>>>>>>>>> </script> and outside them, and your implementation allows things like
>>>>>>>>> @{ <script> } </script>.
>>>>>>>>>
>>>>>>>>> A full wart-for-wart compatible version would be painful to write. A
>>>>>>>>> version which assumed well-formed cases would be much easier to write.
>>>>>>>>> But before thinking about coding up an implementation it's probably
>>>>>>>>> worth thinking about why you want to do this. The answer to that kind
>>>>>>>>> of question can be really interesting and can help identify which
>>>>>>>>> warts are unnecessary or possibly even detrimental.
>>>>>>>>>
>>>>>>>>> So, before I think any more about code, what are your thoughts on what
>>>>>>>>> you want to accomplish?
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Raul
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Jan 11, 2014 at 3:40 PM, Joe Bogner <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> > I have about 300 code files (javascript and embedded code) that I 
>>>>>>>>> > want
>>>>>>>>> > to collect some metrics on.  I've written the algorithm using an
>>>>>>>>> > imperative style. I actually wrote it first in C# and translated to 
>>>>>>>>> > J
>>>>>>>>> >
>>>>>>>>> > Here is the code (posted a link for brevity):
>>>>>>>>> >
>>>>>>>>> > J version:
>>>>>>>>> > https://gist.github.com/joebo/936ca5e2017c0a3b5c56
>>>>>>>>> >
>>>>>>>>> > C# version:
>>>>>>>>> > https://gist.github.com/joebo/e7f8e3ca7bd21117e58d
>>>>>>>>> >
>>>>>>>>> > This is what it outputs
>>>>>>>>> >
>>>>>>>>> > calc''
>>>>>>>>> > blocks 3
>>>>>>>>> > max depth 2
>>>>>>>>> > max block 113
>>>>>>>>> > scripts 2
>>>>>>>>> > max script 26
>>>>>>>>> >
>>>>>>>>> > Any suggestions on how to do it differently in J? I looked into the
>>>>>>>>> > sequential machine some but couldn't figure out how to make it work
>>>>>>>>> > (if it could) since my approach required knowledge of the brace 
>>>>>>>>> > depth.
>>>>>>>>> >
>>>>>>>>> > In terms of requirements:
>>>>>>>>> > 1. Take a block of text
>>>>>>>>> > 2. Identify the code blocks in the file (start with @{ and end with 
>>>>>>>>> > } )
>>>>>>>>> > 3. Count the code blocks
>>>>>>>>> > 4. Determine the max depth of the code block
>>>>>>>>> > 5. Determine the max size of all the code blocks
>>>>>>>>> > 6. Count the javascript blocks
>>>>>>>>> > 7. Determine the max size of the javascript block
>>>>>>>>> >
>>>>>>>>> > Thanks for any feedback or input!
>>>>>>>>> >
>>>>>>>>> > Joe
>>>>>>>>> > ----------------------------------------------------------------------
>>>>>>>>> > For information about J forums see 
>>>>>>>>> > http://www.jsoftware.com/forums.htm
>>>>>>>>> ----------------------------------------------------------------------
>>>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>>>>>
>>>>>>>> ----------------------------------------------------------------------
>>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>>> ----------------------------------------------------------------------
>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>> ----------------------------------------------------------------------
>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] feedback on parsing file approach

Reply via email to