Jungtaek Lim created SPARK-30946:
------------------------------------
Summary: FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow
type to serialize/deserialize entry
Key: SPARK-30946
URL: https://issues.apache.org/jira/browse/SPARK-30946
Project: Spark
Issue Type: Improvement
Components: Structured Streaming
Affects Versions: 3.0.0
Reporter: Jungtaek Lim
HDFSMetadataLog and its descendants are normally using JSON serialization to
serialize/deserialize entries.
While it's good to support backward compatibility (like field addition and
field deletion), it tends to take bunch of overhead as it adds field names, and
should store all data types to string (at least when it's being written to
file), works badly for some kind of fields - e.g. timestamp.
The major overhead is heavily affecting to CompactibleFileStreamLog, as
"compact" operation requires to load all entities and do the
transformation/filtering (I haven't seen any effective operation being
implemented though), and store them altogether into one file. This is the one
of major reason why the metadata file is too huge and it brings unacceptable
latency on "compact" operation.
Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and
FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. The
latest modification of both classes were year 2016. We can call it "stable" -
and then we have more option to optimize serde.
One easy but pretty effective approach on optimizing serde is converting to
UnsafeRow and storing it on the same way we do in HDFSBackedStateStoreProvider,
and vice versa. It has being running for 2.x versions, so the approach is
considered as safe.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]