Re: pygments: get regex patterns for particular token types

Tim Hatch Sat, 30 Jul 2011 15:45:20 -0700

On 7/20/11 9:10 AM, Adam wrote:
> I am trying to use the lexers in pygments to extract comments in code
> files. I can filter out just the comments I want successfully, but the
> comment tokens always include the respective language's comment syntax
> (eg for Python, I'm looking for hash-prefix comments and doc strings).
> I would ideally like to be able to reverse lookup the regex pattern
> for a given token definition so that I can use this regex to strip the
> comment syntax from the comment tokens (so that I'm left with only the
> comment text itself).


If you're trying to make this a general solution, you need to also parse
more complex punctuation, e.g. javadoc

/**
 ** foo
 ** /

and ocaml's nested comments

(* comment
   (* ex *)
*)

> I have tried writing a function to iterate over the token definitions
> to look up this information. I first managed to work around (I think)
> the include capabilities of token definitions, by diving into a
> recursive lookup when an include type is found. However I am now
> stumped by the callback functions (using, bygroups). I could probably
> figure out a way to overcome this hurdle but I'm also guessing there
> should be a way to get all the contents of the token definitions for a
> lexer as evaluated tuples.

The simplest thing is for languages that match comments all at once in
Pygments, to modify the pattern to use groups.  Picking on c-style:

(r'(/\*)(.*?)(\*/)', bygroups(Comment.Prelude, Comment.Data,
Comment.Postlude)),

Then just check for the type you want, Comment.Data:

from pygments.token import Comment

for t, data in PythonLexer.get_tokens_unprocessed(text):
    if t in Comment.Data:
        print "Got comment", repr(data)

For more complex comments that involve states, similarly you should be
able to adapt them something like

  (r'/\*', Comment.Prelude, 'comment'),
...
'comment': [
  (r'\*/', Comment.Postlude, '#pop'),
  (r'[^\*]+', Comment.Data),
  (r'\*', Comment.Data),
]

I don't remember off the top of my head whether get_tokens_unprocessed
will coalesce like types together, but that's not too difficult in your
loop (especially if you're already keeping track of the position there).

If this doesn't help, perhaps more info on your end goal would be useful.

Tim

-- 
You received this message because you are subscribed to the Google Groups 
"pocoo-libs" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/pocoo-libs?hl=en.

Re: pygments: get regex patterns for particular token types

Reply via email to