Re: [ovs-dev] [PATCH] json: Improve json parsing

Rosemarie O'Riorden Tue, 22 Mar 2022 13:17:50 -0700

Hi Dumitru,

Thank you for the review! I'm going to send a v2 shortly with some changes.


> I'm not sure what the benefit is to have 'start' as 'const unsigned char
> *'; it could easily be 'const char *' and we wouldn't have to cast later
> on.  Also, a comment might be useful, describing that this points to the
> start of the input that has been preprocessed but not yet copied to
> 'buffer'.

I agree with this and have updated 'start' to be a const char *, which
took away the need for all those casts. I have also added a comment to
explain what p->start is.


> On the other hand, json_lex_keyword(p) calls json_parser_input(p, ..)
> which clears p->buffer at the end.
>
> So I thought that we should set p->start = NULL when we clear p->buffer.
>
> But json_parser_input() changes the lexer state back to JSON_LEX_START
> so p->start will be reset when json_lex_input() is called next.
>
> I don't really have a good suggestion about how to improve this, it just
> took me a bit to figure out why it's working correctly now.

Maybe I can add a few comments to better explain what is going on.
SInce p->start is updated at the next json_lex_input call, I guess
this logic doesn't need to be changed, but if you think it does, let
me know.


> I keep wondering if there's a way to avoid the p->start all together.
> What if we instead pass 'start' as an argument to json_lex_input() along
> with an offset to indicate the next character to process?

I thought about this for a while. I originally had this idea too, but
I found that it made less sense to me. If I removed start and replaced
it with an offset count, I'd need to track the offset with a pointer
that needs to be incremented for every character that is processed.
The way I have it now, p->start is only assigned once at every state
change, so it makes it quite simple in my opinion. If there is more
input on how I can improve this, please let me know.


> > -        json_lex_input(p, ' ');
> > +        p->start = space;
> > +        json_lex_input(p, space);
>
> This can be:
>
> p->start = " ";
> json_lex_input(p, " ");
>

I've updated this the way you suggested.


Thank you,
Rosemarie O'Riorden


On Thu, Mar 3, 2022 at 11:14 AM Dumitru Ceara <[email protected]> wrote:
>
> Hi Rosemarie,
>
> On 2/28/22 16:39, Rosemarie O'Riorden wrote:
> > To parse a json string prior to this change, json_lex_input is called
> > with each character of the string. If the character needs to be copied
> > to the buffer, it is copied individually. This is an expensive
> > operation, as often there are multiple characters in a row
> > that need to be copied, and copying memory in blocks is more efficient
> > than byte by byte. To improve this, the string is now copied
> > in blocks. As long as the parser state stays the same, no copy is
> > performed. As soon as the state is about to change (for example,
> > reaching the end of a string), the last n characters are copied to the
> > buffer.
> >
> > There is also a conditional that checks if the current character is a
> > new line, which is quite unlikely. When this was examined with perf, the
> > comparison had a very high CPU cycle usage. To improve this, the
> > OVS_UNLIKELY macro was used, which forces the compiler to switch the
> > order of the instructions.
> >
> > The json_lex_input function is called for each character of a string,
> > and calling the function so many times uses a lot of CPU cycles.
> > Making it an inline function greatly reduced the percentage of cycles
> > used by the call.
> >
> > Here is the improvement seen in the json-string-benchmark test:
> >
> >   SIZE      Q  S         BEFORE       AFTER      CHANGE
> > --------------------------------------------------------
> > 100000      0  0 :      0.842 ms     0.511 ms   -39.3 %
> > 100000      2  1 :      0.917 ms     0.561 ms   -38.8 %
> > 100000      10 1 :      1.063 ms     0.678 ms   -36.2 %
> > 10000000    0  0 :     85.328 ms    51.998 ms   -39.1 %
> > 10000000    2  1 :     92.555 ms    57.353 ms   -38.0 %
> > 10000000    10 1 :    106.728 ms    69.288 ms   -35.1 %
> > 100000000   0  0 :    955.375 ms   632.300 ms   -33.8 %
> > 100000000   2  1 :   1031.700 ms   689.375 ms   -33.2 %
> > 100000000   10 1 :   1189.300 ms   821.850 ms   -30.0 %
> >
>
> Nice!  I can reproduce similar results on my machine when compiling with
> -O2.
>
> I think your patch works but I have some comments below.
>
> > Here Q is probability (%) for a character to be a '\"' and
> > S is probability (%) to be a special character ( < 32).
> >
> > Signed-off-by: Rosemarie O'Riorden <[email protected]>
> > ---
> >  lib/json.c | 28 ++++++++++++++++++++++------
> >  1 file changed, 22 insertions(+), 6 deletions(-)
> >
> > diff --git a/lib/json.c b/lib/json.c
> > index 720c73d94..f48cb3241 100644
> > --- a/lib/json.c
> > +++ b/lib/json.c
> > @@ -101,6 +101,7 @@ struct json_parser {
> >      int line_number;
> >      int column_number;
> >      int byte_number;
> > +    const unsigned char *start;
> >
>
> I'm not sure what the benefit is to have 'start' as 'const unsigned char
> *'; it could easily be 'const char *' and we wouldn't have to cast later
> on.  Also, a comment might be useful, describing that this points to the
> start of the input that has been preprocessed but not yet copied to
> 'buffer'.
>
> >      /* Parsing. */
> >      enum json_parse_state parse_state;
> > @@ -976,16 +977,18 @@ json_lex_string(struct json_parser *p)
> >      }
> >  }
> >
> > -static bool
> > -json_lex_input(struct json_parser *p, unsigned char c)
> > +static inline ALWAYS_INLINE bool
> > +json_lex_input(struct json_parser *p, const unsigned char *ch)
>
> This can be 'const char *ch'.
>
> >  {
> >      struct json_token token;
> > +    unsigned char c = *ch;
> >
> >      switch (p->lex_state) {
> >      case JSON_LEX_START:
> >          switch (c) {
> >          case ' ': case '\t': case '\n': case '\r':
> >              /* Nothing to do. */
> > +            p->start = ch + 1;
>
> This shouldn't really be an issue but it feels a bit awkward that 'ch +
> 1' can actually point immediately after the end of the memory zone 'ch'
> points to.
>
> >              return true;
> >
> >          case 'a': case 'b': case 'c': case 'd': case 'e':
> > @@ -995,21 +998,25 @@ json_lex_input(struct json_parser *p, unsigned char c)
> >          case 'u': case 'v': case 'w': case 'x': case 'y':
> >          case 'z':
> >              p->lex_state = JSON_LEX_KEYWORD;
> > +            p->start = ch;
> >              break;
> >
> >          case '[': case '{': case ']': case '}': case ':': case ',':
> >              token.type = c;
> >              json_parser_input(p, &token);
> > +            p->start = ch + 1;
> >              return true;
> >
> >          case '-':
> >          case '0': case '1': case '2': case '3': case '4':
> >          case '5': case '6': case '7': case '8': case '9':
> >              p->lex_state = JSON_LEX_NUMBER;
> > +            p->start = ch;
> >              break;
> >
> >          case '"':
> >              p->lex_state = JSON_LEX_STRING;
> > +            p->start = ch + 1;
> >              return true;
> >
> >          default:
> > @@ -1024,6 +1031,7 @@ json_lex_input(struct json_parser *p, unsigned char c)
> >
> >      case JSON_LEX_KEYWORD:
> >          if (!isalpha((unsigned char) c)) {
> > +            ds_put_buffer(&p->buffer, (const char *) p->start, ch - 
> > p->start);
>
> If 'start' is of type 'const char *' we don't need the cast anymore.
>
> Shouldn't we reset p->start here?
>
> >              json_lex_keyword(p);
>
> On the other hand, json_lex_keyword(p) calls json_parser_input(p, ..)
> which clears p->buffer at the end.
>
> So I thought that we should set p->start = NULL when we clear p->buffer.
>
> But json_parser_input() changes the lexer state back to JSON_LEX_START
> so p->start will be reset when json_lex_input() is called next.
>
> I don't really have a good suggestion about how to improve this, it just
> took me a bit to figure out why it's working correctly now.
>
> I keep wondering if there's a way to avoid the p->start all together.
> What if we instead pass 'start' as an argument to json_lex_input() along
> with an offset to indicate the next character to process?
>
> >              return false;
> >          }
> > @@ -1031,6 +1039,7 @@ json_lex_input(struct json_parser *p, unsigned char c)
> >
> >      case JSON_LEX_NUMBER:
> >          if (!strchr(".0123456789eE-+", c)) {
> > +            ds_put_buffer(&p->buffer, (const char *) p->start, ch - 
> > p->start);
>
> Same comment about the cast here.
>
> >              json_lex_number(p);
> >              return false;
> >          }
> > @@ -1040,6 +1049,7 @@ json_lex_input(struct json_parser *p, unsigned char c)
> >          if (c == '\\') {
> >              p->lex_state = JSON_LEX_ESCAPE;
> >          } else if (c == '"') {
> > +            ds_put_buffer(&p->buffer, (const char *) p->start, ch - 
> > p->start);
>
> Here too.
>
> >              json_lex_string(p);
> >              return true;
> >          } else if (c < 0x20) {
> > @@ -1055,7 +1065,6 @@ json_lex_input(struct json_parser *p, unsigned char c)
> >      default:
> >          abort();
> >      }
> > -    ds_put_char(&p->buffer, c);
> >      return true;
> >  }
> >
> > @@ -1164,10 +1173,11 @@ size_t
> >  json_parser_feed(struct json_parser *p, const char *input, size_t n)
> >  {
> >      size_t i;
> > +    p->start = (const unsigned char *) input;
> >      for (i = 0; !p->done && i < n; ) {
> > -        if (json_lex_input(p, input[i])) {
> > +        if (json_lex_input(p, (unsigned const char *) &input[i])) {
>
> If we change json_lex_input() to accept 'const char *' instead, we don't
> need the cast.
>
> >              p->byte_number++;
> > -            if (input[i] == '\n') {
> > +            if (OVS_UNLIKELY(input[i] == '\n')) {
> >                  p->column_number = 0;
> >                  p->line_number++;
> >              } else {
> > @@ -1176,6 +1186,10 @@ json_parser_feed(struct json_parser *p, const char 
> > *input, size_t n)
> >              i++;
> >          }
> >      }
> > +    if (!p->done) {
> > +        ds_put_buffer(&p->buffer, (const char *) p->start,
> > +                (const unsigned char *) &input[i] - p->start);
>
> Same here, cast can be avoided.
>
> > +    }
> >      return i;
> >  }
> >
> > @@ -1189,6 +1203,7 @@ struct json *
> >  json_parser_finish(struct json_parser *p)
> >  {
> >      struct json *json;
> > +    const unsigned char *space = (const unsigned char *) " ";
> >
> >      switch (p->lex_state) {
> >      case JSON_LEX_START:
> > @@ -1201,7 +1216,8 @@ json_parser_finish(struct json_parser *p)
> >
> >      case JSON_LEX_NUMBER:
> >      case JSON_LEX_KEYWORD:
> > -        json_lex_input(p, ' ');
> > +        p->start = space;
> > +        json_lex_input(p, space);
>
> This can be:
>
> p->start = " ";
> json_lex_input(p, " ");
>
> >          break;
> >      }
> >
>
> Thanks,
> Dumitru
>

_______________________________________________
dev mailing list
[email protected]
https://mail.openvswitch.org/mailman/listinfo/ovs-dev

Re: [ovs-dev] [PATCH] json: Improve json parsing

Reply via email to