You may prefer to go with procedural parsing, but I worked up a grammar driven solution which I've just pasted here: http://scsys.co.uk:8002/370004

It tests OK AFAIK. The strategy was to, first, use the lexer to slurp up "simple coments" -- those with no interior equal signs or hash signs. Second, to make tag names and base declarations also lexemes. LATM guarantees tag names and base declarations will only be accepted in the correct context. Finally, there is a fall back 'ComplexComment' which is lexed character-by-character.

Note this plays along with the lexer's greediness. What long stuff can be safely slurped up in the lexer, I slurp up first. Then I go for the specifics. Finally, I have a fallback which, if nothing else works, will eat up the comment character-by-character. It uses lexemes one character long, guaranteeing that it will only be used as the last resort.

The character-by-character fallback will be slow if used a lot, but it should rarely be used. This does not exhaust the possible tricks, by any means, but I think it parses everything correctly and in what I expect will be reasonable time.

-- jeffrey

On 05/09/2014 02:40 PM, [email protected] wrote:
As a matter of fact, I cannot. I have seen actual cases of '#' characters in comments. They've confused some other code unrelated to mine as well.

I had hoped the strict format of a tag string or the embedded base number string would be sufficient to differentiate them from the random comments. It's appearing that's not really the case. Perhaps my only real choice is to pause when I find the initial '#' character and parse the comment myself. Telling a tag string or an embedded base number from a comment is trivial but I don't know how you determine context in that case as both are location sensitive. As a matter of fact, a real comment embedded in one of the files looks like this:

###### END OF CHECKERBOARD #####

On Friday, May 9, 2014 1:41:14 PM UTC-7, Jeffrey Kegler wrote:

    One question: can you rely on a non-tag comment not containing a
    hash?  That is, can you rely on there being nothing like
    PList file1.plist:plist3; # An extra hash # as if life was not
    already too difficult

    in the data?  If so, you can treat a hash ('#') as something that
    ends a comment, in addition to newlines, and that will be a big
    step forward.

    -- jeffrey

    On 05/09/2014 01:08 PM, [email protected] <javascript:> wrote:
    Yeah, that line is definitely the problematic line.  It's also
    the reason I'm rebuilding the parser from my current line by line
    methodology.  Or attempting to :)  I actually wrote this grammar
    up in Regexp::Grammars first, but the resource requirements were
    far too high.  I figured I'd take the time to learn Marpa as the
    capabilities and performance seem more in line with what I needed.

    I believe event parsing the comments myself might be the way to
    go.  I was also reading ranking documentation this morning, but I
    didn't get a good handle on it at all.  Maybe I'll play with it
    and see what happens.

    Thanks for your time and insight here Jeffrey, I appreciate it :)

    On Friday, May 9, 2014 12:55:07 PM UTC-7, Jeffrey Kegler wrote:

        I just took a second look at this one

        GlobalPList plist4 { Pat n8000000g0000008; #KEEP# } }
        Ouch!  The solution in the face of stuff like this may be to
        not treat comments at the lexical level, but at the G1
        level.  That is, treat the '#',  ',', tags, etc. as lexemes
        and parse comments as if they were statements.  In your
        situation, that seems in effect to be the case.  Your
        comments seem to have more structure and variety than some of
        the "statements".  They are not just whitespace equivalents.

        At the G1 level you can use rule "rank" adverb
        
(https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/DSL.pod#rank
        
<https://metacpan.org/pod/distribution/Marpa-R2/pod/Scanless/DSL.pod#rank>),
        Marpa can help with the internal semantics of the comments. etc.

        I notice, by the way, that my documentation of the "rank"
        adverb could be improved.

        -- jeffrey

        On 05/09/2014 12:09 PM, [email protected] wrote:
        You have the right idea. Unfortunately, I do not get to
        dictate the syntax of this file I get to parse and there is
        considerable ambiguity in comments.  There are essentially
        three forms of a comment.  Two forms of this comment include
        information I need to parse.  One form (non-information
        comment) does not contain useful information.

        1) embedded base number --> Matches OptEmbeddedBase -->
        Actual information I need.  Discernable from a
        non-information comment by it's location immediately after
        the opening of a pattern list brace and that if must contain
        '#base=<list>', where <list> is a comma delimited list of
        integers.

        2) tag string --> Matches TagStr --> Again, information I
        need.  Discernable from a non-information comment by
        location after a pattern declaration and by the fact that it
        is bookended by '#' symbols can can only contain a comma
        delimited list of word (\w) characters. Technically,
        whitespace is not allowed inside these strings either.  I
        figured I'd sort that out once I had it matching as is.

        3) Non information comment -> Matches COMMENT --> Can be
        discarded.  This is any comment that does not match one of
        the first two forms.

        Hopefully that's helpful.  When you say that you'd 'simply
        say that in the grammar', I'm confused.  Is this not what
        I'm saying in the grammar in the TagStr rule by setting '#'
        characters before and after the TagList rule? Is there a
        better way to resolve this ambiguity?

        On Friday, May 9, 2014 11:46:16 AM UTC-7, Jeffrey Kegler wrote:

            Trying to get the idea, is it that tags use '#' as a
            delimiter, much in
            the same way that strings use quotes?  And that's it's a
            comment if
            there's a '#' that is not matched before the newline?
             That is, that in

                 Pat n2000000g0000002; #HOT# # Not so hot

            "#HOT#" is a tag, and "# Not so hot" is a comment?

            If that's the case, I'd simply say that in the grammar.
             I'd give more
            detail, but I'm not 100% clear on the intent at this point.

            -- jeffrey

-- You received this message because you are subscribed to the
        Google Groups "marpa parser" group.
        To unsubscribe from this group and stop receiving emails
        from it, send an email to [email protected].
        For more options, visit https://groups.google.com/d/optout
        <https://groups.google.com/d/optout>.

-- You received this message because you are subscribed to the
    Google Groups "marpa parser" group.
    To unsubscribe from this group and stop receiving emails from it,
    send an email to [email protected] <javascript:>.
    For more options, visit https://groups.google.com/d/optout
    <https://groups.google.com/d/optout>.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected] <mailto:[email protected]>.
For more options, visit https://groups.google.com/d/optout.

--
You received this message because you are subscribed to the Google Groups "marpa 
parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to