ptmcg opened a new issue, #2860:
URL: https://github.com/apache/iceberg-python/issues/2860
### Feature Request / Improvement
I'm happy to see that pyparsing is part of the pyiceberg project. I've
attached two files that are railroad diagrams for this parser:
- pyiceberg_diag.html - browseable railroad diagram (SVG), non-terrminal
elements are links to subdiagrams
- pyiceberg_parser.png - static PNG version of the same diagram
Overall I think the parser is well-structured, and makes good use of the
`infix_notation`, `one_of`, and `DelimitedList` helpers.
My suggestions are largely cosmetic, but may add some parse-time performance
improvements.
Recommendations to use `set_name` are either for better diagramming, better
exception messages, or both.
## `set_results_name()` vs. `set_name()`
I'd like to point out the distinction between `set_name` and
`set_results_name`. `set_name` is most useful to describe the expression
itself, in abstract. `set_results_name` should be used to mark expressions with
some contextual name, so that their parsed value can be retrieved by that name.
I usually think of `set_name` describes what the expression _is_, and
`set_results_name` is how the expression _is used_.
For example, these expressions should probably use `set_name`, not
`set_results_name`:
```python
boolean = one_of(["true", "false"],
caseless=True).set_results_name("boolean")
string = sgl_quoted_string.set_results_name("raw_quoted_string")
decimal = common.real().set_results_name("decimal")
integer = common.signed_integer().set_results_name("integer")
literal = Group(string | decimal | integer |
boolean).set_results_name("literal")
literal_set = Group(
DelimitedList(string) | DelimitedList(decimal) | DelimitedList(integer)
| DelimitedList(boolean)
).set_results_name("literal_set")
```
You'll also find that exception messages are less cryptic with more use of
`set_name`.
```python
Word(nums).parse_string("NAN")
# raises ParseException: Expected W:(0-9), found 'NAN'
Word(nums).set_name("integer").parse_string("NAN")
# raises ParseException: Expected integer, found 'NAN'
```
There are 11 calls to set_results_name.
Many of the `set_results_name` calls are probably better as `set_name`
calls. I try to reserve `set_results_name` for expressions that need to be
referenced in a parse action, or other post-parsing code. "op" is a good
example of using `set_results_name`.
"column" is an excellent example of being an exception to this guideline. It
is used in many places, but always as the subject of some larger expression.
"column" should be defined using both `set_results_name` and `set_name`.
Lastly, you can keep your set_name and set_results_name calls to a minimum:
- insert a call to pyparsing.autoname_elements(), which will call set_name
on all locally defined ParserElements that have not already been given a name
(making this call before calling the `infix_notation` function will use the
names as part of that functions internal expression building)
- use the `expr("results_name")` short-cut for
`expr.set_results_name("results_name")`
## Collapse IS/IS NOT calls
You can reduce the number of terms in your parser by merging IS and IS NOT
expressions (fewer terms can translate into faster parsing). Here you define
separate `is_null` and `not_null` expressions in building null_check:
```python
is_null = column + IS + NULL
not_null = column + IS + NOT + NULL
null_check = is_null | not_null
```
I propose defining an IS_NOT expression, so that null_check can simplify
down (and also reduce 2 parse action functions down to one):
```python
IS_NOT = (IS + NOT).add_parse_action(lambda: "IS_NOT")
null_check = column + (IS_NOT | IS)("op") + NULL
@null_check.set_parse_action
def _(result: ParseResults) -> BooleanExpression:
expr_class = IsNull if result.op == "IS" else NotNull
return expr_class(result.column)
```
Similar treatment can be given to IN/NOT IN and LIKE/NOT LIKE.
## Creating the diagram
I wrote this little script to build the attached diagrams:
```python
import pyiceberg.expressions.parser as ibp
ibp.boolean_expression.create_diagram("pyiceberg_parser_diag.html")
```
Run this before making any changes, and you'll get a diagram that is
difficult to navigate, probably even difficult to view! Make a few `set_name`
calls (like adding `set_name("column")` to `column`) and regenerate the diagram
and see how the structure becomes clearer. The non-terminals in the diagram are
actually links to their corresponding sub-diagrams, so navigation in a large
diagram is much easier. I've gone back and refactored a number of the pyparsing
examples, after generating their diagrams.
## Other notes
Not really a pyparsing note, but just offering this alternative style for
your if-elif-else chains:
```python
@left_ref.set_parse_action
def _(result: ParseResults) -> BooleanExpression:
op_classes = {
">": GreaterThan,
">=": GreaterThanOrEqual,
"<": LessThan,
"<=": LessThanOrEqual,
"=": EqualTo,
"==": EqualTo,
"!=": NotEqualTo,
"<>": NotEqualTo,
}
op_class = op_classes.get(result.op)
if op_class is not None:
return op_class(result.column, result.literal)
raise ValueError(f"Unsupported operation type: {result.op!r}")
```
I encourage people at work, when echoing erroneous values in an exception,
to add the "!r" format marker. This encloses the value in single quotes _and_
expands any non-printing characters to hex notation. A big help for when the
parser accidentally includes trailing space in the operator string, and so it
doesn't match any of the expected patterns.
## Finally
Again, your parser is fine as-is, these suggestions just may make it a bit
easier to maintain, and even a bit faster.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]