branch: externals/matlab-mode
commit d09c0bd8760a8758c68669dece654455d41bf7fb
Author: John Ciolfi <john.ciolfi...@gmail.com>
Commit: John Ciolfi <john.ciolfi...@gmail.com>

    treesit-mode-how-to.org: updated
---
 contributing/treesit-mode-how-to.org | 132 +++++++++++++++++++----------------
 1 file changed, 70 insertions(+), 62 deletions(-)

diff --git a/contributing/treesit-mode-how-to.org 
b/contributing/treesit-mode-how-to.org
index 8c9a31272e..47f3038be4 100644
--- a/contributing/treesit-mode-how-to.org
+++ b/contributing/treesit-mode-how-to.org
@@ -35,6 +35,14 @@
 #+author: John Ciolfi
 #+date: Sep-5-2025
 
+I created this guide while developing, matlab-ts-mode, a 
[[https://tree-sitter.github.io/tree-sitter/][tree-sitter]] powered mode for 
[[https://www.mathworks.com][MATLAB]]. I
+tried to make this guide general so it could be reused for development of 
other languages. Perhaps,
+the guide could be intergrated into Emacs documentation?
+
+I developed matlab-ts-mode using Emacs 30. The more I learned about 
tree-sitter, the more I liked
+it. I was very much impressed with the quality of the tree-sitter itself and 
the integration of
+tree-sitter in Emacs. The quality of the intergation of tree-sitter in Emacs 
is exceptional.
+
 * What does tree-sitter provide?
 
 Tree-sitter provides a parse tree for your language in real-time. The 
tree-sitter parser for your
@@ -58,21 +66,29 @@ languages like C/C++, LSP parses the include headers so it 
can provide go-to def
 references, diagnostics warning and error messages, and similar capabilities. 
These LSP capabilities
 are not provided by tree-sitter, nor does it make sense for tree-sitter to 
provide them. It makes
 perfect sense that Emacs provides both tree-sitter and LSP because they both 
provide complementary
-capabilities for coding. There is a little overlap between LSP and tree-sitter 
in that both can
-provide indentation (code formatting) and semantic coloring. The advantage of 
tree-sitter is that it
-is faster, more accurate in context of syntax errors, and works without 
requiring the concept of a
-project. You can open a source file from anywhere and tree-sitter can 
semantically color it, indent
-it, etc.
+capabilities for coding.
+
+There is a small amount of overlap between LSP and tree-sitter in that both 
can provide indentation
+(code formatting) and semantic highlighting. The advantage of tree-sitter is 
that it is faster and
+more accurate indentation as you type. Another bonus is that tree-sitter works 
without requiring a
+project or other setup to get things going. LSP requires typically requires 
the concept of a project
+so it can parse your code. With tree-sitter, you can open a source file from 
anywhere and
+tree-sitter can semantically color it, indent it, etc.
+
+Try using LSP for syntax highlighting or code indentation on a large file 
where you type at a
+productive speed of 40-75 words per minute. The experience will be less than 
ideal. Now try that
+where syntax highlighting and code indentation are powered by tree-sitter. 
You'll be pleasantly
+suprised how good tree-sitter is. The editor will be much smoother with 
higher-quality syntax
+highlighting and code indentation. You see spend much less time having to 
adjust whitespace to
+make your code look good because the indentation as you type is much better.
 
 * Guide to building a tree-sitter mode
 
-This guide to building a *LANGUAGE-ts-mode* for /file.lang/ files was written 
using Emacs 30.1.
-
-In creating a tree-sitter mode for a programming language, you have two 
options. You can leverage an
-old-style existing mode via =(define-derived-mode LANGUAGE-ts-mode 
OLD-LANGUAGE-mode "LANGUAGE"
-...)= and then override items such as font-lock and indent. The other approach 
is to create a new
-LANGUAGE-ts-mode based on prog-mode which we recommend. Taking this approch 
eliminates unnecessary
-coupling between the old-style mode and the new tree-sitter mode.
+In creating a tree-sitter mode, *LANGUAGE-ts-mode* for /file.lang/ files, you 
have two options. You
+can leverage an old-style existing mode via =(define-derived-mode 
LANGUAGE-ts-mode OLD-LANGUAGE-mode
+"LANGUAGE" ...)= and then override items such as font-lock and indent. The 
other approach is to
+create a new LANGUAGE-ts-mode based on prog-mode which we recommend. Taking 
this approach eliminates
+unnecessary coupling between the old-style mode and the new tree-sitter mode.
 
 #+begin_src emacs-lisp
  (define-derived-mode LANGUAGE-ts-mode prog-mode "LANGUAGE" ...)
@@ -106,7 +122,7 @@ example, when writing a font-lock test, you provide the 
=file.lang= and run the
 see there is no expected baseline to compare against, so it will generate one 
for you and ask you to
 validate it. The expect baseline for =file.lang= is =file_expected.txt= and 
the contents of the
 =file_expected.txt= is of same length of =file.lang=, where each character's 
face is encoded in a
-single character. This makes it very easy to lock down the behavior of 
font-lock without having to
+single character. This makes it very easy to lock down the behaviour of 
font-lock without having to
 write lisp code to add the expected results of the test. The same test 
strategy is used for other
 aspects of our =LANGUAGE-ts-mode=.
 
@@ -322,7 +338,7 @@ This will display messages of the following form which can 
be helpful in debuggi
 
  : Fontifying text from START-POINT to END-POINT, Face: FACE, Node: TYPE
 
-Another debugging tip, is to use the =%S= format specifier in calls to message 
which displays the
+Another debugging tip is to use the =%S= format specifier in calls to message 
which displays the
 lisp object representation.  For example, in our defun 
LANGUAGE-ts-mode--comment-to-do-capture, we
 could add =(message "debug comment-node: %S" comment-node)= which will show 
what it's processing.
 Using EDebug on font-lock functions can be tricky because they get called on 
display updates.
@@ -426,7 +442,7 @@ a unique string to start the comments, so they are 
searchable.
 The =treesit-font-lock-feature-list= contains four sublists where the first 
sublist is font-lock
 level 1, and so on. Each sublist contains a set of feature; names that 
correspond to the =:feature
 'NAME= entries in =LANGUAGE-ts-mode--font-lock-settings=.  For example, 
='comment= for comments,
-='definition= for function and other definitions, ='keyword= for language 
keywords, etc. Font-lock
+='definition= for function and similar definitions', ='keyword= for language 
keywords, etc. Font-lock
 applies the faces defined in each sublist up to and including 
`treesit-font-lock-level', which
 defaults to 3. If you'd like to have your font-lock default to level 4, add:
 
@@ -669,7 +685,7 @@ If you look at the definition of parent-is, you'll see it 
leverages =string-matc
 matching against =(treesit-node-type parent-node)=.  Therefore, to be precise, 
we match using the
 start of the string, =bos=, and end of string, =eos=.  If your nodes are 
unique enough, you can
 leave off the =bos= and =eos=, but that could be troublesome if the grammar is 
updated. For example,
-suppose you have a "function" node and you match using =(parent-is 
"function")=, then the grammar is
+suppose you have a "function" node, and you match using =(parent-is 
"function")=, then the grammar is
 updated to have regular "function" nodes and "function2" nodes where you want 
to different font for
 "function2".  The =(parent-is "function")= will match both. Therefore, we 
recommend being precise
 when matching which will also give a slight boost in performance.
@@ -766,7 +782,7 @@ the rules, it is good to lock down expected behavior with 
tests.
 
 *** Setup: Indent Considerations
 
-1. Indent rules maybe easy to define using the treesit package pre-defined 
matchers and anchors
+1. Indent rules may be easy to define using the treesit package pre-defined 
matchers and anchors
    when there are no syntax errors.
 
 2. It is a good idea to ensure that indent work well when there are syntax 
errors thus giving
@@ -961,8 +977,8 @@ The commands are executed and recorded. The recorded 
results are compared agains
 
  : =./tests/test-matlab-ts-mode-indent-xr-files/indent_cell1_expected.org=
 
-If the baseline doesn't exist or result doesn't match the baseline, the test 
fails and
-the following tilde file is created:
+If the baseline doesn't exist or the result doesn't match the baseline, the 
test fails, and the
+following tilde file is created:
 
  : =./tests/test-matlab-ts-mode-indent-xr-files/indent_cell1_expected.org~=
 
@@ -970,40 +986,39 @@ You can then rename the tilde file to 
=indent_cell1_expected.org= or fix the cod
 
 ** Sweep test: Indent
 
-We define a sweep test to be a test that tries an action on a large number of 
files and reports
-issues it finds.  Sweep tests differ from classic baseline tests such as the 
above where we run
-functions and check the result for correctness.  A sweep test of indent on 
many thousands of
-LANGUAGE files cannot check the result of each individual indent because there 
is no baseline
-results for each file. However, a sweep test can check for asserts, unexpected 
errors, and slow
-indents. It can also check for invalid parse trees reported by the LANGUAGE 
tree-sitter if you have
-an external command that can check for syntax errors in your LANGUAGE files.
+We define a sweep test to be a test that tries an action on many files and 
reports issues it finds.
+Sweep tests differ from classic baseline tests such as the above where we run 
functions and check
+the result for correctness.  A sweep test of indent on many thousands of 
LANGUAGE files cannot check
+the result of each individual indent because there is no baseline results for 
each file. However, a
+sweep test can check for asserts, unexpected errors, and slow indents. It can 
also check for invalid
+parse trees reported by the LANGUAGE tree-sitter if you have an external 
command that can check for
+syntax errors in your LANGUAGE files.
 
 Our indent sweep test takes a directory and runs indent-region all LANGUAGE 
files under the
 directory recursively.
 
- - If the parse tree indicates an error, we call the external syntax checker 
to double
-   check that the file does indeed have a syntax error. If the external 
checker says the
-   file does not have a syntax error, we report the file and this is likely a 
bug in
-   the LANGUAGE tree-sitter parser.
+ - If the parse tree indicates an error, we call the external syntax checker 
to double check that
+   the file does indeed have a syntax error. If the external checker says the 
file does not have a
+   syntax error, we report the file, and this is likely a bug in the LANGUAGE 
tree-sitter parser.
 
- - If check-valid-parse below is t the test will call syntax checker on all 
files being
-   processed to verify that the a successful tree-sitter parse also has no 
errors according to
-   syntax checker. Any inconsistent parses are reported which is likely a bug 
in the
-   tree-sitter parser.
+ - If check-valid-parse below is t the test will call syntax checker on all 
files being processed to
+   verify that there was a successful tree-sitter parse also that there are no 
errors according to
+   syntax checker. Any inconsistent parses are reported which is likely a bug 
in the tree-sitter
+   parser.
 
- - Next, =indent-region= is run on the file in a temporary buffer. The time it 
takes is
-   recorded in a table.  The slowest indents are reported.  If you see slow 
indents, there
-   could be bugs in your tree-sitter parser.
+ - Next, =indent-region= is run on the file in a temporary buffer. The time it 
takes is recorded and
+   the slowest indents are reported.  If you see slow indents, there could be 
bugs in your
+   tree-sitter parser.
 
- - If =indent-region= errors out, then that is also reported.  For example, 
suppose we write a
+ - If =indent-region= generates errors, then they is also reported.  For 
example, suppose we write a
    lambda indent MATCHER that contains
 
     : (string-match-p my-node-regexp (treesit-node-type 
(treesit-node-prev-sibling parent))
 
    In our classic test things work fine because our test has a parent with a 
previous
-   sibling. However, we may have missed that parent may not have a previous 
sibling. A sweep of a
-   large number of LANGUAGE files has good probability of hitting this. If 
parent doesn't have a
-   previous sibling, we'll get "error (void-function string-match-p)."
+   sibling. However, we may have missed that parent may not have a previous 
sibling. A sweep of many
+   LANGUAGE files has a good probability of hitting this. If parent doesn't 
have a previous sibling,
+   we'll get "error (void-function string-match-p)."
 
 Our indent sweep test:
 
@@ -1268,7 +1283,7 @@ Syntactic expressions, s-expressions, or simply sexp 
commands operate on /balanc
 expressions/. Strings are naturally balanced expressions because they start 
and end with some type
 of quote character. Likewise brackets =[ items ]= and braces ={ items }= are 
typically balanced
 expressions because they have open and close characters. Some languages have 
keywords expressions
-that have a starting keyword and an ending keyword. For example "if" could be 
paired with a closing
+that have a starting keyword and an ending keyword. For example, "if" could be 
paired with a closing
 "end" keyword. s-expressions can span multiple lines. s-expressions can be 
nested. These commands
 leverage ='sexp= and ='text= things:
 
@@ -1369,7 +1384,7 @@ behavior because one can then fix the syntax behaviors by 
adding appropriate str
 continuations. There's no way to alter the string filling behavior besides 
using defadvice, which
 you should not do.
 
-If your syntax table correctly identifies comments and strings, then it M-q 
just works, though you
+If your syntax table correctly identifies comments and strings, then =M-q= 
just works, though you
 should still add tests to validate it works.  If you'd like tree-sitter nodes 
other than comments
 and strings to be filled like plain text, you should add a =text= entry to 
=treesit-thing-settings=,
 e.g. if nodeName1 and nodeName2 should be filled like plain text, use:
@@ -1546,8 +1561,8 @@ the mode line. You can view imenu in a sidebar window, 
using, [[https://github.c
 
 To populate imenu, in LANGUAGE-ts-mode, we setup 
=treesit-simple-imenu-settings=, where each element
 is of form =(category regexp pred name-fn)=, but form many languages, you only 
need to specify the
-first two elements.  When name-fcn is nil the imenu names are generated the
-=treesit-defun-name-function= which we already setup.
+first two elements.  When name-fcn is nil the imenu names are generated by the
+=treesit-defun-name-function= which we already set up.
 
 #+begin_src emacs-lisp
   (defvar LANGUAGE-ts-mode--imenu-settings
@@ -1576,8 +1591,8 @@ patterns.
 
 * Setup: Outline, treesit-outline-predicate
 
-This needs to be setup if treesit-simple-imenu-settings isn't set and you are 
using a custom
-imenu-create-index-function as we did above.
+This needs to be set up if =treesit-simple-imenu-settings= has not been set 
and you are using a
+custom =imenu-create-index-function= as we did above.
 
 #+begin_src emacs-lisp
   (defun LANGUAGE-ts-mode--outline-predicate (node)
@@ -1604,7 +1619,7 @@ and
 
 ** Test: Outline
 
-To add tests, we follow similar pattern to our other tests above and leverage
+To add tests, we follow a similar pattern to our other tests above and leverage
 =t-utils-test-outline-search-function=.
 
 * Setup: Electric Pair, electric-pair-mode
@@ -1995,7 +2010,7 @@ version and learn from it.
 
 Tree-sitter powered modes provide highly accurate syntax coloring, 
indentation, and other features.
 In addition, tree-sitter modes are generally much more performant than the 
older-style regular
-expression based modes, especially for a reasonably complex programming 
language.
+expression-based modes, especially for a reasonably complex programming 
language.
 
 A downside of a tree-sitter mode is that the necessary 
=libtree-sitter-LANGUAGE.SLIB= shared library
 files are not provided with the =NAME-ts-mode='s that are shipped with Emacs. 
For =NAME-ts-mode='s
@@ -2097,7 +2112,7 @@ Install, using default branch
 
   If you use prev-line on the blank-line immediately after "b = 2;", you'll 
get the expected point
   below "b". If you use prev-line on the second blank line after "b = 2;", the 
point move the the
-  first blank line after the "b = 2;" statuement which may not be what you 
want. Prehaps prev-real
+  first blank line after the "b = 2;" statement which may not be what you 
want. Perhaps prev-real
   should look backwards to the first prior line with non-whitespace. If 
there's concern about
   compatibility, treesit could be updated to have:
 
@@ -2140,7 +2155,7 @@ Example:
 #+end_example
 
 Note the build of the dll from 
https://github.com/emacs-tree-sitter/tree-sitter-langs is good.
-Perhaps, Visual Studio is needed and =M-x treesit-install-language-grammar= 
should look for
+Perhaps, Visual Studio is needed, and =M-x treesit-install-language-grammar= 
should look for
 that?
 
 ** =M-x treesit-install-language-grammar= doesn't check the ABI version.
@@ -2158,23 +2173,16 @@ If tree-sitter isn't found, it should offer to download 
it.
 ** M-q (prog-fill-reindent-defun) splits strings
 
 When the point is in a string and you type M-q it will split long strings into 
multiple lies which
-results in syntax errors in some languages, e.g. C.
-
-: char * str = "a very long string a very long string a very long string a 
very long string a very long string a very long string a very long string a 
very long string ";
-
-results in:
-
-Would like an option to have M-q indent or fill comments. When in a string it 
should do nothing
-if it can't guarantee the syntax will be correct. Ideally, we'd have a way to 
fill strings
-by using the appropriate string concatenation characters.
+results in syntax errors in some languages. It would be nice to either fix 
this or have an option
+that instructs M-q to indent or fill comments, but never split strings. When 
in a string it
+should do nothing if it can't guarantee the syntax will be correct. Ideally, 
we'd have a way to fill
+strings by using the appropriate string concatenation characters.
 
 ** Doc for treesit-thing-settings is misleading.
 
 It mentions a "comment" thing, but that is not used by treesit. Also looking 
at the
 setting for C/C++, what's written
 
-  : Here's an example treesit-thing-settings for C and C++:
-  :
   : ((c
   :   (defun "function_definition")
   :   (sexp (not "[](),[{}]"))

Reply via email to