Re: Parsing a potentially corrupted file

MRAB Wed, 14 Dec 2016 11:43:30 -0800

On 2016-12-14 11:43, Paul Moore wrote:

I'm looking for a reasonably "clean" way to parse a log file that potentially 
has incomplete records in it.


The basic structure of the file is a set of multi-line records. Each record starts with a 
series of fields delimited by [...] (the first of which is always a date), optionally 
separated by whitespace. Then there's a trailing "free text" field, optionally 
followed by a multi-line field delimited by [[...]]

So, example records might be

[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of 
the issue goes here

(a record delimited by the end of the line)

or

[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of 
the issue goes here [[Additional
data, potentially multiple lines

including blank lines
goes here
]]

The terminating ]] is on a line of its own.

This is a messy format to parse, but it's manageable. However, there's a catch. 
Because the logging software involved is broken, I can occasionally get a log 
record prematurely terminated with a new record starting mid-stream. So 
something like the following:

[2016-11-30T20:04:08.000+00:00] [Component] [le[2016-11-30T20:04:08.000+00:00] 
[Component] [level] [] [] [id] Description of the issue goes here

I'm struggling to find a "clean" way to parse this. I've managed a clumsy 
approach, by splitting the file contents on the pattern [ddd-dd-ddTdd:dd:dd.ddd+dd:dd] 
(the timestamp - I've never seen a case where this gets truncated) and then treating each 
entry as a record and parsing it individually. But the resulting code isn't exactly 
maintainable, and I'm looking for something cleaner.

Does anyone have any suggestions for a good way to parse this data?

I think I'd do something like this:

while have_more(input):
    # At the start of a record.
    timestamp = parse_timestamp(input)

    fields = []
    description = None
    additional = None

    try:
        for i in range(5):
            # A field shouldn't contain a '[', so if it sees one one, it'll
            # push it back and return True for truncated.
            field, truncated = parse_field(input)
            fields.append(fields)

            if truncated:
                raise TruncatedError()

# The description shouldn't contain a timestamp, but if itdoes, it'll

        # push it back from that point and return True for truncated.
        description, truncated = parse_description(input)

        if truncated:
            raise TruncatedError()

# The additional information shouldn't contain a timestamp, butif it

        # does, it'll push it back from that point and return True for
        # truncated.
        additional, truncated = parse_additional_information(input)

        if truncated:
            raise TruncatedError()
    except TruncatedError:

process_record(timestamp, fields, description, additional,truncated=True)

    else:
        process_record(timestamp, fields, description, additional)

--
https://mail.python.org/mailman/listinfo/python-list

Re: Parsing a potentially corrupted file

Reply via email to