GitHub user andrewor14 opened a pull request:

    https://github.com/apache/spark/pull/4821

    [SPARK_6066] Make event log format easier to parse

    The event log format before was incredibly difficult to parse:
    ```
    sparkVersion = 1.3.0
    compressionCodec = org.apache.spark.io.LZFCompressionCodec
    === LOG_HEADER_END ===
    // actual events, could be compressed bytes
    ```
    When compression is turned on, for instance, the metadata is not compressed 
while the rest of the log is. Note that we can't compress the metadata because 
it contains the name of the compression codec, which we need to even open the 
log in the first place.
    
    The new format puts the compression codec and the Spark version in the log 
file name instead. It also represents the metadata in the first line of the 
event log as JSON, which is easy for 3rd party applications to parse:
    ```
    {"Event": "SparkListenerMetadataIdentifier", "SPARK_VERSION":"1.3.0", 
"COMPRESSION_CODEC":"..."}
    // actual events. If compression is turned on the whole file, including the 
metadata, is compressed.
    ```
    and the file name looks something like:
    ```
    EVENT_LOG_app_123_SPARK_VERSION_1.3.1
    EVENT_LOG_app_123_SPARK_VERSION_1.3.1_COMPRESSION_CODEC_{...}
    ```
    
    I tested this with and without compression, using different compression 
codecs and event logging directories. I verified that both the `Master` and the 
`HistoryServer` can render both compressed and uncompressed logs as before.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/andrewor14/spark event-log-format

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/4821.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #4821
    
----
commit 8db5a06d108d8a2ddb8460e48e3509f46cc4fc2f
Author: Andrew Or <[email protected]>
Date:   2015-02-27T23:29:26Z

    Embed metadata in the event log file name instead
    
    This makes the event logs much easier to parse than before.
    As of this commit the whole file is either entirely compressed
    or not compressed, but not somewhere in between.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to