Re: _changes line breaks (was Re: _changes resource)

Matt Goodall Thu, 09 Jul 2009 03:04:31 -0700

2009/7/7 Matt Goodall <[email protected]>:
> Splitting the discussion of line breaks in the _changes document into
> separate email thread ...
>
> 2009/7/6 Chris Anderson <[email protected]>:
>> On Mon, Jul 6, 2009 at 5:50 AM, Matt Goodall<[email protected]> wrote:
>
>>> == Line Breaks ==
>>>
>>> If each results item is sent with its ending newline (the "," is sent
>>> with the next item) it would make clients much easier and correct to
>>> write, i.e. buffer bytes until a newline is received, split the
>>> buffer, process the row, repeat. You've still got to remove the ","
>>> from all but the first line but it's in a predictable place. Actually,
>>> I don't believe TCP provides any guarantees that bytes sent are
>>> received in the same chunks so relying on anything other than the
>>> newline is probably flawed.
>>>
>>> It's a trivial change, patch attached.
>>
>> There's a certain elegance to the current system. So far I've been
>> testing in the browser and it works fine. If there's demonstrated
>> problems for a client then we shouldn't hesitate to change it.
>
> Agreed, a comma at the end of the line is much prettier.
>
> The 'changes' tests are very unlikely to highlight any problem because
> there's such a small amount of data being sent (well below the MTU of
> the network device) and a sleep(100) is almost certainly enough to
> allow the data to arrive in the browser. If the tests caused lots of
> data to be sent and the browser was listening for data using a
> onreadystatechange callback we may be lucky enough to hit the problem.
> However, "almost" and "may" are not good words when it comes to
> testing ;-).
>
> Anyway, from experience I believe the only way to prove this is to
> explicitly have the bytes arrive slowly so, when I get a couple of
> minutes, I'll write something simple that will hopefully demonstrate
> the value of the newline terminator.


Attached is a quick and dirty Python script with two versions of
handling a continuous _changes stream:

    * changes_comma_eol works with CouchDB trunk.
    * changes_eol_comma works with a patched CouchDB.

I really haven't exactly tested them extensively but I think both are
correct although I wouldn't be surprised if there are some edge cases
I've missed, especially in the changes_comma_eol version. I think it's
reasonably clear which version is simpler, and therefore less error
prone, for a client to implement

There are definitely some things that could be done to improve the
comma_eol version a little but I wanted to keep the code as simple as
possible and I don't think there's any way of completely avoiding some
unnecessary JSON parsing.

Hope that's useful. I'll create a ticket with the patch and the
example so it doesn't get lost.

- Matt

import httplib
import json

def GET(db):
    conn = httplib.HTTPConnection('localhost', 5984)
    conn.request('GET', '/%s/_changes?continuous=true&timeout=2000'%db)
    response = conn.getresponse()
    def gen_data():
        while True:
            data = response.read(1)
            if not data:
                break
            yield data
        conn.close()
    return response.getheaders(), gen_data()

def changes_eol_comma(db, callback):
    headers, stream = GET(db)
    buffer = ''
    for data in stream:
        buffer += data
        lines = buffer.split('\n')
        # Iterate only the lines we know are complete.
        for line in lines[:-1]:
            # Skip uninteresting lines.
            if line == '{"results":[' or line == '],' or line == '':
                continue
            # Handle "last_seq" line to get final seq.
            if line.startswith('"last_seq"'):
                return int(line[11:-1])
            # We have a changes row left. It may include a leading comma.
            if line[0] == ',':
                line = line[1:]
            callback(json.loads(line))
        # Buffer remaining bytes.
        buffer = lines[-1]

def changes_comma_eol(db, callback):
    headers, stream = GET(db)
    buffer = ''
    for data in stream:
        buffer += data
        lines = buffer.split('\n')
        # Iterate all lines even though the last one may not be complete yet.
        for line in lines:
            # Skip uninteresting lines.
            if line == '{"results":[' or line == '],' or line == '':
                continue
            # Handle "last_seq" line to get final seq.
            if line.startswith('"last_seq"'):
                # But only if we've received the whole line.
                if not line[-1] == '}':
                    buffer = line
                    continue
                return int(line[11:-1])
            # The line is now either a full change row or part of a line that
            # makes up some part of the document but is not enough to identify
            # it yet.
            try:
                # There may be a leading or trailing comma on the line
                # depending on whether a previous line's ending comma had
                # arrrived when the line was parsed. Just in case, strip from
                # both end.
                callback(json.loads(line.strip(',')))
            except ValueError:
                # Couldn't parse it so leave until we have more bytes.
                buffer = line
            else:
                buffer = ''

def changed(row):
    print "** change:", row

# Uncomment the version that matches CouchDB.
#last_seq = changes_eol_comma('test', changed)
last_seq = changes_comma_eol('test', changed)
print "** last_seq:", last_seq

Re: _changes line breaks (was Re: _changes resource)

Reply via email to