On Wed, Dec 14, 2016 at 10:43 PM, Paul  Moore <p.f.mo...@gmail.com> wrote:
> This is a messy format to parse, but it's manageable. However, there's a 
> catch. Because the logging software involved is broken, I can occasionally 
> get a log record prematurely terminated with a new record starting 
> mid-stream. So something like the following:
>
> [2016-11-30T20:04:08.000+00:00] [Component] 
> [le[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description 
> of the issue goes here
>
> I'm struggling to find a "clean" way to parse this. I've managed a clumsy 
> approach, by splitting the file contents on the pattern 
> [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] (the timestamp - I've never seen a case where 
> this gets truncated) and then treating each entry as a record and parsing it 
> individually. But the resulting code isn't exactly maintainable, and I'm 
> looking for something cleaner.
>

Is the "[Component]" section something you could verify? (That is - is
there a known list of components?) If so, I would include that as a
secondary check. Ditto anything else you can check (I'm guessing the
[level] is one of a small set of values too.) The logic would be
something like this:

Read line from file.
Verify line as a potential record:
    Assert that line begins with timestamp.
    Verify as many fields as possible (component, level, etc)
    Search line for additional timestamp.
    If additional timestamp found:
        Recurse. If verification fails, assume we didn't really have a
corrupted line.
        (Process partial line? Or discard?)
    If "[[" in line:
        Until line is "]]":
            Read line from file, append to description
            If timestamp found:
                Recurse. If verification succeeds, break out of loop.

 Unfortunately it's still not really clean; but that's the nature of
working with messy data. Coping with ambiguity is *hard*.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list

Reply via email to