On 2016-12-14 11:43, Paul Moore wrote:
I'm looking for a reasonably "clean" way to parse a log file that potentially
has incomplete records in it.
The basic structure of the file is a set of multi-line records. Each record starts with a
series of fields delimited by [...] (the first of which is always a date), optionally
separated by whitespace. Then there's a trailing "free text" field, optionally
followed by a multi-line field delimited by [[...]]
So, example records might be
[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of
the issue goes here
(a record delimited by the end of the line)
or
[2016-11-30T20:04:08.000+00:00] [Component] [level] [] [] [id] Description of
the issue goes here [[Additional
data, potentially multiple lines
including blank lines
goes here
]]
The terminating ]] is on a line of its own.
This is a messy format to parse, but it's manageable. However, there's a catch.
Because the logging software involved is broken, I can occasionally get a log
record prematurely terminated with a new record starting mid-stream. So
something like the following:
[2016-11-30T20:04:08.000+00:00] [Component] [le[2016-11-30T20:04:08.000+00:00]
[Component] [level] [] [] [id] Description of the issue goes here
I'm struggling to find a "clean" way to parse this. I've managed a clumsy
approach, by splitting the file contents on the pattern [ddd-dd-ddTdd:dd:dd.ddd+dd:dd]
(the timestamp - I've never seen a case where this gets truncated) and then treating each
entry as a record and parsing it individually. But the resulting code isn't exactly
maintainable, and I'm looking for something cleaner.
Does anyone have any suggestions for a good way to parse this data?
I think I'd do something like this:
while have_more(input):
# At the start of a record.
timestamp = parse_timestamp(input)
fields = []
description = None
additional = None
try:
for i in range(5):
# A field shouldn't contain a '[', so if it sees one one, it'll
# push it back and return True for truncated.
field, truncated = parse_field(input)
fields.append(fields)
if truncated:
raise TruncatedError()
# The description shouldn't contain a timestamp, but if it
does, it'll
# push it back from that point and return True for truncated.
description, truncated = parse_description(input)
if truncated:
raise TruncatedError()
# The additional information shouldn't contain a timestamp, but
if it
# does, it'll push it back from that point and return True for
# truncated.
additional, truncated = parse_additional_information(input)
if truncated:
raise TruncatedError()
except TruncatedError:
process_record(timestamp, fields, description, additional,
truncated=True)
else:
process_record(timestamp, fields, description, additional)
--
https://mail.python.org/mailman/listinfo/python-list