Github user vanzin commented on the pull request:
https://github.com/apache/spark/pull/4821#issuecomment-76780546
@pwendell the header is needed because it contains potentially useful
information for the code parsing the logs. For example, now it contains the
Spark version, which might be needed to tell the parsing code which properties
to expect in the logs.
The original version of the change (the one that got rid of the directories
and used a single file) encoded all metadata in the file name. The feedback was
that it was ugly (long, cryptic file names) and brittle, since if you change
the file name, you lose that information. I agree with that and thus the header
was born.
Now we're back to encoding metadata in the file name. A simple extension is
not to bad, though, espcially since you can probably figure out the compression
codec by looking at the first few bytes of the file. But the header still
provides useful information.
So I'm a little worried that the latest patch removes the metadata
completely. Especially since it's common for the first event of the log to
*not* be the one that contains the spark version
(`SparkListenerEnvironmentUpdate`?), and instead be
`SparkListenerBlockManagerAdded`.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]