Re: [Jchat] feedback on parsing file approach

Joe Bogner Sun, 12 Jan 2014 11:49:16 -0800

I think I have the state table figured out. I created a little image
to help explain it.


http://imgur.com/JT7MAhj

I may post this to the wiki as an example. Thanks again



On Sun, Jan 12, 2014 at 1:49 PM, Joe Bogner <[email protected]> wrote:
> Great! Yes, that change counts the blocks the way I need it to.
>
> As you pointed out, the requirements weren't very well spec'd and led
> to ambiguities. I had thought the implementation was relatively clear
> and it wasn't a good assumption to think that it would be read. I
> think tests would have worked well to more clearly illustrate the
> expectations. I couldn't imagine writing a parser without them.
>
> Raul wrote:
>> A related issue is that your measure of "depth" did not include @{ } -
>> both immediately inside and immediately outside these tokens is depth
>> 0, if I understand properly what you meant by depth.
>
> Yes, that's fine. My goal was to measure how deep the nesting was. It
> doesn't matter to me if its zero or one based.
>
> Thanks for the tips on how to incorporate logic for detecting in { } in 
> strings.
>
> We actually weren't that far off on understanding considering how
> little you needed to change to count blocks the way I had intended.
>
> The max blocks still isn't right, but that's OK. I will see if I can
> fix it or start writing some tests to demonstrate it better.
>
> text =: 0 : 0
> @{ if (foo) { } }
> )
>
> shows max block 2
>
> text =: 0 : 0
> @{ Response.Write("hi"); }
> )
>
> shows max block 0, which leads me to believe it needs a brace inside
> the code block to start counting. I would have assumed it would be
> some number of characters close to # ' Response.Write("hi"); '
>
> 23
>
> I should be able to figure out most of how you did it but I'm stumped
> on the State table. I think I understand classify except for one part
> (summarized below)
>
>
> classify=: 1}.Ends i. 2 {"1(5;(States,"+0);((<"+Chars),<a.-.Chars);0
> _1 0 0) ;: ]
>
> Working right to left
>
> NB. I can figure this out later. Your explanation is good and the
> dictionary covers it
> ijrd =. 0 _1 0 0
>
> NB. The first part makes sense, the second part looks to be bunch of
> junk characters?
> m=.((<"+Chars),<a.-.Chars)
>
> In a trivial example, it looks like it classifies the rows the same way:
>
>  m=.((<"+Chars))
>  (y i.~;m) { (#m),~(#&>m)#i.#m NB. From the dictionary entry
>
> 0 1 12 12 12 5 9 12 12 5 12 12 12 7 8 10 12 12 12 12 8 12 12 12 12 2 12
>
> m=.((<"+Chars),<a.-.Chars)
> (y i.~;m) { (#m),~(#&>m)#i.#m NB. From the dictionary entry
>
> 0 1 12 12 12 5 9 12 12 5 12 12 12 7 8 10 12 12 12 12 8 12 12 12 12 2 12
>
> What is the purpose of <a.-.Chars? Is that "every other character"
> than what was specified?
>
> s=:(States,"+0)
>
> This adds the 0 operation to each of the States per your earlier note
> of using 0 to no-op
>
> f=: 5
> (f;s;m;ijrd) ;: text
>
>
> NB. extract the 3rd column from the trace
> cols=:2 {"1(f;s;m;ijrd) ;: text
> 0 1 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0
>
> NB. turn the col back into the token number from appendToken
> Ends i. cols
>
>
> I think I understand all of that.
>
> The State table has a shape of 21 13, which is the length of the text
> token we are looking for on the rows and length of the characters that
> make up those tokens on the columns.
>
> # '@{}<script></script>'
> 20
>
> # Chars
> 12
>
> I took a stab at adding the character and token on the x & y axis. I
> don't think I have it lined up quite right and I'm sure it doesn't
> look great on e-mail. If you can help decrypt the table that would be
> helpful as I am not following completely what appendToken is doing to
> build it.
>
>
>
>   @ { } <  /  s  c  r  i  p  t  >
> @ 1 2 3 4  0  0  0  0  0  0  0  0 0
> { 1 2 3 4  0  0  0  0  0  0  0  0 0
> } 1 2 3 4  0  0  0  0  0  0  0  0 0
> < 1 2 3 4  0  0  0  0  0  0  0  0 0
> s 0 0 0 0 13  5  0  0  0  0  0  0 0
> c 0 0 0 0  0  0  6  0  0  0  0  0 0
> r 0 0 0 0  0  0  0  7  0  0  0  0 0
> i 0 0 0 0  0  0  0  0  8  0  0  0 0
> p 0 0 0 0  0  0  0  0  0  9  0  0 0
> t 0 0 0 0  0  0  0  0  0  0 10  0 0
>> 0 0 0 0  0  0  0  0  0  0  0 11 0
> < 1 2 3 4  0  0  0  0  0  0  0  0 0
> / 0 0 0 0  0  0  0  0  0  0  0  0 0
> s 0 0 0 0  0 14  0  0  0  0  0  0 0
> c 0 0 0 0  0  0 15  0  0  0  0  0 0
> r 0 0 0 0  0  0  0 16  0  0  0  0 0
> i 0 0 0 0  0  0  0  0 17  0  0  0 0
> p 0 0 0 0  0  0  0  0  0 18  0  0 0
> t 0 0 0 0  0  0  0  0  0  0 19  0 0
>> 0 0 0 0  0  0  0  0  0  0  0 20 0
>  1 2 3 4  0  0  0  0  0  0  0  0 0
>
> States 5,6,7,8,9,10,11 must be used to track <script> or
> 14,15,16,17,18,19,20 does it.
>
> Not sure why there isn't a state 12
>
> Any guidance on the table would be appreciated. This is really cool.
> Thanks again
>
>
> On Sun, Jan 12, 2014 at 12:16 PM, Raul Miller <[email protected]> wrote:
>> Ok...
>>
>> Translating what I think you are saying into implementation, I think
>> you want to change
>>
>> smoutput 'blocks ',":Left -&(+/) Codes
>>
>> to
>>
>> smoutput 'blocks ',":+/ Codes
>>
>> "Left" was bits marking left curly brackets (there were 6 in your
>> sample text) while Codes was bits marking instances of @{  (there were
>> 3 in your sample text).
>>
>> A related issue is that your measure of "depth" did not include @{ } -
>> both immediately inside and immediately outside these tokens is depth
>> 0, if I understand properly what you meant by depth.
>>
>> You will note, here, that I did not actually read your code very
>> closely - that is because I was more interested in paraphrasing it
>> than in copying it, and that means understanding what you were
>> thinking more than understand what you implemented. We sometimes
>> approximate this process using requirements, sometimes using tests and
>> perhaps in a variety of other ways.
>>
>> Also, it might help you to understand the code better if you replaced
>> every =. in calc2 with =: (=. is great for isolating internal
>> definitions in explicit verbs, but =: is much better for making things
>> visible or ... explicit?).
>>
>> That said, we can exclude { and } which appear in irrelevant contexts
>> by first declaring what those contexts are (double quoted strings?
>> multi-line comments? single line comments?) and then adjusting the
>> definition of State to distinguish them from the recognized instances
>> of { and }.
>>
>> Let's say that I wanted to exclude { in double quoted strings. Here's
>> an outline:
>>
>> (1) Include " in the definition of Chars
>> (2) Introduce a new routine appendTokenPair which works like
>> appendToken but leaves the sequential machine in an alternate state
>> until receiving a second token.
>> (3) use this new routine to include " ... " in our definition of State.
>>
>> Once this was working, using it for /* ... */ should be trivial,
>> though the use of multi-character tokens might be an issue, depending
>> on how appendTokenPair was implemented.
>>
>> The thing you need to watch out for, when working with parsers, is
>> ambiguities. In this example, we had an ambiguity between @{ and {
>> where hypothetically speaking they might be confused. This was one of
>> my motivations for focusing on requirements instead of simply diving
>> into the implementation.
>>
>> Being able to move from implementation to specification is not easy -
>> I love focusing on the computer and I sometimes find human
>> interactions painful (I do not like bothering people and while I might
>> occasionally enjoy getting yelled at I find I need to do something to
>> please people yelling at me after - or at least something I am
>> comfortable interpreting as pleasing - going away seems to count,
>> somehow. Mostly, though, I have a lot of respect for heads-down focus,
>> even when it's taken too far.)
>>
>> Does this make sense?
>>
>> Thanks,
>>
>> --
>> Raul
>>
>>
>>
>> On Sun, Jan 12, 2014 at 11:13 AM, Joe Bogner <[email protected]> wrote:
>>> Sorry about that. My requirements were based on more contextual
>>> knowledge than it probably should have.
>>>
>>> To take a step back:
>>>
>>> In the the c#/razor template language, each code block is delimited by:
>>>
>>> @{
>>>
>>> }
>>>
>>> Within a block, you can add c# code to perform any functions of your
>>> page necessary
>>>
>>> @{
>>>        if (Post) {
>>>             Save();
>>>        } else {
>>>            DoSomethingElse();
>>>       }
>>> }
>>>
>>> A page can have multiple code blocks. And a code block can have an
>>> infinite depth of branching, denoted by { }
>>>
>>> Poor code would have many blocks, or very large blocks or very deep nesting.
>>>
>>> @{
>>>        if (Post) {
>>>             if (Monday) {
>>>                  if (After5PM) {
>>>                          if (Before8PM) {
>>>                             Save();
>>>                        }
>>>                  }
>>>             }
>>>
>>>        } else {
>>>            DoSomethingElse();
>>>       }
>>> }
>>>
>>>
>>> A code block is pairs of @{ } where } terminates after the branch
>>> level is zero. Let me know if that's not clear enough. Other
>>> templating languages like php make it easier.
>>>
>>> <? php
>>>
>>> if (Foo) {
>>>
>>> }
>>>
>>> ?>
>>> <html>foo</html>
>>>
>>>
>>> In PHP, you wouldn't need to worry about the curly brace depth for
>>> determining code block start and end. It could be split on <?php ?>
>>>
>>> In razor, the @{ is the same as <? and the } when brace depth is zero
>>> terminates the block
>>>
>>> So I don't have an exact specification that I'm working towards. I'm
>>> just trying to find out how many @{ } code blocks there are, how
>>> deeply nested the code within is, and how large the largest block is.
>>> For example, if it's more than 20 lines or X characters, it probably
>>> belongs in a separate class or file
>>>
>>> Of course an edge case that would blow up would be if the code block
>>> has a brace in a string
>>>
>>> @{
>>>          if (Post) {
>>>               Response.Write("will break a simple parser } } }} ");
>>>          }
>>> }
>>> I don't think that would be extensive in this code. It's not going to
>>> be used for anything of a critical nature other than to help improve
>>> my personal code base - so if there are false positives or errors it
>>> OK. I'm looking for a "good enough" solution.
>>>
>>> Hope that helps. Feel free to cancel if I'm not getting progressively
>>> more clear or if the problem is uninteresting to help solve.
>>>
>>> Thanks again
>>>
>>> Joe
>>>
>>>
>>>
>>> On Sun, Jan 12, 2014 at 10:56 AM, Raul Miller <[email protected]> wrote:
>>>> I quite possibly misunderstood your specifications.
>>>>
>>>> If I simply remove lines 2 and 11 from my gist, calc2 still reports
>>>> three blocks. If I also remove the three blocks which appear between
>>>> lines 2 and 11, calc2 will then report 0 blocks. Is that not what you
>>>> wanted me to count?
>>>>
>>>> Meanwhile, I do not concern myself very much with whether the
>>>> boundaries of a region of text are "inside" or "outside" that region.
>>>> Instead, I go with what seems simple to implement and then use the
>>>> requirements to tweak the code so that the result is correct. Of
>>>> course, the limitation here is that I need to understand your
>>>> requirements. Another limitation is that new requirements will require
>>>> new code (or manual work) - but that seems to me to be unavoidable.
>>>>
>>>> I expect that once we share an understanding of your requirements that
>>>> an explanation of how the code is structured will make more sense.
>>>>
>>>> Thanks,
>>>>
>>>> --
>>>> Raul
>>>>
>>>>
>>>> On Sun, Jan 12, 2014 at 6:50 AM, Joe Bogner <[email protected]> wrote:
>>>>> Thanks for the sequential machine implementation. I tested with
>>>>> different versions of the text block and it doesn't work as I
>>>>> expected, which means I either relayed the requirements wrong or there
>>>>> may be a bug
>>>>>
>>>>> For example, if I take out the first block of @{ }, it reports
>>>>>
>>>>>    calc2 text
>>>>> blocks 0
>>>>> max depth 1
>>>>> max block 25
>>>>> scripts 2
>>>>> max script 49
>>>>>
>>>>> text =: 0 : 0
>>>>> @{
>>>>> Response.Write('start');
>>>>> }
>>>>> <html>
>>>>> <script>
>>>>> alert('start');
>>>>> </script>
>>>>> <div id='Foo'>@Page.Foo</div>
>>>>> <script>
>>>>> alert($('#Foo').val());
>>>>> </script>
>>>>>
>>>>> </html>
>>>>> @{
>>>>> Response.Write('bye');
>>>>> }
>>>>> )
>>>>>
>>>>> My implementation posts the correct answer of two blocks - each pair
>>>>> of @{ and the } that gets back to indent = 0.
>>>>>
>>>>> It looks like yours requires possibly a brace in the block to trigger
>>>>> it as a code block.  It also seems to be summing up the total amount
>>>>> of code and script characters instead of finding the largest one.
>>>>>
>>>>> The Trace looks helpful to debug.
>>>>>
>>>>> I've read through the dictionary and nuvoc a few times for sequential
>>>>> machine and I don't understand it well enough to help troubleshoot
>>>>> your implementation. I'll spend more time with it. I didn't want to go
>>>>> down that rabbit hole until I was sure it could provide a correct
>>>>> result.
>>>>>
>>>>> I thought about posting to programming but was't sure how
>>>>> philosophical it would get. Probably better to have started there and
>>>>> then migrate here if it was philosophical. Feel free to move it to
>>>>> programming since we're now on the details of the sequential machine
>>>>> implementation.
>>>>>
>>>>> Thanks again. I appreciate the opportunity to learn.
>>>>>
>>>>> On Sat, Jan 11, 2014 at 10:16 PM, Raul Miller <[email protected]> 
>>>>> wrote:
>>>>>> Here's a draft that uses ;:
>>>>>>
>>>>>> https://gist.github.com/rdm/8380234
>>>>>>
>>>>>> (As an aside, perhaps this thread should be on programming? Or at
>>>>>> least, something to think about for next time...)
>>>>>>
>>>>>> Note that I get different character counts than you. Maybe I
>>>>>> misunderstood what you intended to count?
>>>>>>
>>>>>> Let me know if you want me to clarify or rewrite any of that.
>>>>>>
>>>>>> But, briefly, I am using the final states from a ;: trace to mark the
>>>>>> end of each "token" and then classifying the text based on that
>>>>>> analysis. Since this sequential machine is a bit bulky, I decided to
>>>>>> write a small application to build it rather than constructing it by
>>>>>> hand. Since I only care about the state trace, I use no-op for all
>>>>>> operations. Since I want the end state, I use 0 _1 0 0 for ijrd
>>>>>> instead of the default 0 _1 0 _1. This leaves me with my final state
>>>>>> being the "character position" after the last character in text (and
>>>>>> it's reported in the trace rather than being an error condition).
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> --
>>>>>> Raul
>>>>>>
>>>>>> On Sat, Jan 11, 2014 at 4:47 PM, Joe Bogner <[email protected]> wrote:
>>>>>>> Thank you for the thoughts. You summarized it well.
>>>>>>>
>>>>>>> I don't need to worry about attributes on the script tag for this use 
>>>>>>> case.
>>>>>>> I am interested in quantifying how much embedded javascript is in each 
>>>>>>> of
>>>>>>> the pages. I don't need to quantify external scripts. I know the code 
>>>>>>> base
>>>>>>> doesnt use the type="javascript" attribute
>>>>>>>
>>>>>>> The braces should be well formed otherwise the c# razor file wouldn't
>>>>>>> compile. It is possible there may be an edgecase which can be found 
>>>>>>> when I
>>>>>>> run it against all the files.
>>>>>>>
>>>>>>> I plan to use it to identify areas to refactor in the javascript/c# 
>>>>>>> razor
>>>>>>> code base and then watch it improve over time. I also thought it would 
>>>>>>> be
>>>>>>> interesting to use a concise and expressive language, J, to measure the
>>>>>>> more verbose  code base. It doesn’t need to be precise in terms of
>>>>>>> characters. For example, it is ok if the script tag characters are 
>>>>>>> counted
>>>>>>> as long as its consistent. I will be using it find large problem areas 
>>>>>>> and
>>>>>>> then measure the improvement.
>>>>>>>
>>>>>>> I would be interested in seeing the sequential machine approach or any
>>>>>>> other more idiomatic method than mine. I am fairly satisfied with mine. 
>>>>>>> It
>>>>>>> is fairly clear to me and can likely ne extended if needed. I am trying 
>>>>>>> to
>>>>>>> use J more in my day to day and that would help me learn and hopefully
>>>>>>> would be an interesting example for others.
>>>>>>>
>>>>>>> Thanks again
>>>>>>> On Jan 11, 2014 4:11 PM, "Raul Miller" <[email protected]> wrote:
>>>>>>>
>>>>>>>> I think I see how I would do that with a sequential machine. Let me
>>>>>>>> know if you want a working example.
>>>>>>>>
>>>>>>>> Briefly, though, you seem to have three kinds of token pairs:
>>>>>>>>
>>>>>>>> @{   }
>>>>>>>> {  }
>>>>>>>> <script> </script>
>>>>>>>>
>>>>>>>> The ambiguity between the first two is problematic, in the context of
>>>>>>>> errors, but does not matter in well formed cases. A bigger problem in
>>>>>>>> the wild might be that you do not allow for attributes on the script
>>>>>>>> tag.
>>>>>>>>
>>>>>>>> Also, you care about the number of characters between <script>
>>>>>>>> </script> so those characters should be saved as "tokens" even if they
>>>>>>>> are not curly braces. You care about {} between both @{ } and <script>
>>>>>>>> </script> and outside them, and your implementation allows things like
>>>>>>>> @{ <script> } </script>.
>>>>>>>>
>>>>>>>> A full wart-for-wart compatible version would be painful to write. A
>>>>>>>> version which assumed well-formed cases would be much easier to write.
>>>>>>>> But before thinking about coding up an implementation it's probably
>>>>>>>> worth thinking about why you want to do this. The answer to that kind
>>>>>>>> of question can be really interesting and can help identify which
>>>>>>>> warts are unnecessary or possibly even detrimental.
>>>>>>>>
>>>>>>>> So, before I think any more about code, what are your thoughts on what
>>>>>>>> you want to accomplish?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> --
>>>>>>>> Raul
>>>>>>>>
>>>>>>>>
>>>>>>>> On Sat, Jan 11, 2014 at 3:40 PM, Joe Bogner <[email protected]> 
>>>>>>>> wrote:
>>>>>>>> > I have about 300 code files (javascript and embedded code) that I 
>>>>>>>> > want
>>>>>>>> > to collect some metrics on.  I've written the algorithm using an
>>>>>>>> > imperative style. I actually wrote it first in C# and translated to J
>>>>>>>> >
>>>>>>>> > Here is the code (posted a link for brevity):
>>>>>>>> >
>>>>>>>> > J version:
>>>>>>>> > https://gist.github.com/joebo/936ca5e2017c0a3b5c56
>>>>>>>> >
>>>>>>>> > C# version:
>>>>>>>> > https://gist.github.com/joebo/e7f8e3ca7bd21117e58d
>>>>>>>> >
>>>>>>>> > This is what it outputs
>>>>>>>> >
>>>>>>>> > calc''
>>>>>>>> > blocks 3
>>>>>>>> > max depth 2
>>>>>>>> > max block 113
>>>>>>>> > scripts 2
>>>>>>>> > max script 26
>>>>>>>> >
>>>>>>>> > Any suggestions on how to do it differently in J? I looked into the
>>>>>>>> > sequential machine some but couldn't figure out how to make it work
>>>>>>>> > (if it could) since my approach required knowledge of the brace 
>>>>>>>> > depth.
>>>>>>>> >
>>>>>>>> > In terms of requirements:
>>>>>>>> > 1. Take a block of text
>>>>>>>> > 2. Identify the code blocks in the file (start with @{ and end with 
>>>>>>>> > } )
>>>>>>>> > 3. Count the code blocks
>>>>>>>> > 4. Determine the max depth of the code block
>>>>>>>> > 5. Determine the max size of all the code blocks
>>>>>>>> > 6. Count the javascript blocks
>>>>>>>> > 7. Determine the max size of the javascript block
>>>>>>>> >
>>>>>>>> > Thanks for any feedback or input!
>>>>>>>> >
>>>>>>>> > Joe
>>>>>>>> > ----------------------------------------------------------------------
>>>>>>>> > For information about J forums see 
>>>>>>>> > http://www.jsoftware.com/forums.htm
>>>>>>>> ----------------------------------------------------------------------
>>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>>>>
>>>>>>> ----------------------------------------------------------------------
>>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>>> ----------------------------------------------------------------------
>>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>>> ----------------------------------------------------------------------
>>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>>> ----------------------------------------------------------------------
>>>> For information about J forums see http://www.jsoftware.com/forums.htm
>>> ----------------------------------------------------------------------
>>> For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jchat] feedback on parsing file approach

Reply via email to