[
https://issues.apache.org/jira/browse/SPARK-6066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14340958#comment-14340958
]
Marcelo Vanzin edited comment on SPARK-6066 at 2/27/15 10:43 PM:
-----------------------------------------------------------------
Nothing wrong, I just don't see how it's much better. The user trying to read
it externally still needs to know that if there is a certain extension he needs
to use a particular compression codec. And he still needs to understand that
the first line, even though it's JSON, is not actually an event, but a header,
and needs to understand the contents of that header. (Right now I don't think
there's anything particularly interesting there, but at some point there might
- e.g. the Spark version might become important to help understand the rest of
the file.)
A library would make all that transparent to this user. Basically something
like "java.util.zip.ZipFile", where instead of bytes you have a collection of
"ZipEntries" (here you'd have a collection of "SparkListenerEvent").
No strong opinion one way or another, I just thing the library is nicer for the
end user and more flexible in the long run.
was (Author: vanzin):
Nothing wrong, I just don't see how it's much better. The user trying to read
it externally still needs to know that if there is a certain extension he needs
to use a particular compression codec. And he still needs to understand that
the first line, even though it's JSON, is not actually an event, but a header,
and needs to understand the contents of that header. (Right now I don't think
there's anything particularly interesting there, but at some point there might
- e.g. the Spark version might become important to help understand the rest of
the file.)
A library would make all that transparent to this user. Basically something
like "java.util.zip.ZipFile", where instead of bytes you have a collection of
"ZipEntries" (here you'd have a collection of "SparkListenerEvent").
No strong opinion one way or another, I just thing the library is nices for the
end user and more flexible in the long run.
> Metadata in event log makes it very difficult for external libraries to parse
> event log
> ---------------------------------------------------------------------------------------
>
> Key: SPARK-6066
> URL: https://issues.apache.org/jira/browse/SPARK-6066
> Project: Spark
> Issue Type: Bug
> Affects Versions: 1.3.0
> Reporter: Kay Ousterhout
> Assignee: Andrew Or
> Priority: Blocker
>
> The fix for SPARK-2261 added a line at the beginning of the event log that
> encodes metadata. This line makes it much more difficult to parse the event
> logs from external libraries (like
> https://github.com/kayousterhout/trace-analysis, which is used by folks at
> Berkeley) because:
> (1) The metadata is not written as JSON, unlike the rest of the file
> (2) More annoyingly, if the file is compressed, the metadata is not
> compressed. This has a few side-effects: first, someone can't just use the
> command line to uncompress the file and then look at the logs, because the
> file is in this weird half-compressed format; and second, now external tools
> that parse these logs also need to deal with this weird format.
> We should fix this before the 1.3 release, because otherwise we'll have to
> add a bunch more backward-compatibility code to handle this weird format!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]