deserialize entry

Jungtaek Lim (Jira) Tue, 25 Feb 2020 05:52:32 -0800

Jungtaek Lim created SPARK-30946:
------------------------------------

             Summary: FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow 
type to serialize/deserialize entry
                 Key: SPARK-30946
                 URL: https://issues.apache.org/jira/browse/SPARK-30946
             Project: Spark
          Issue Type: Improvement
          Components: Structured Streaming
    Affects Versions: 3.0.0
            Reporter: Jungtaek Lim



HDFSMetadataLog and its descendants are normally using JSON serialization to 
serialize/deserialize entries.

While it's good to support backward compatibility (like field addition and 
field deletion), it tends to take bunch of overhead as it adds field names, and 
should store all data types to string (at least when it's being written to 
file), works badly for some kind of fields - e.g. timestamp.

The major overhead is heavily affecting to CompactibleFileStreamLog, as 
"compact" operation requires to load all entities and do the 
transformation/filtering (I haven't seen any effective operation being 
implemented though), and store them altogether into one file. This is the one 
of major reason why the metadata file is too huge and it brings unacceptable 
latency on "compact" operation.

Fortunately, the entity class for both FileStreamSourceLog (FileEntry) and 
FileStreamSinkLog (SinkFileStatus) haven't been modified for over 3 years. The 
latest modification of both classes were year 2016. We can call it "stable" - 
and then we have more option to optimize serde.

One easy but pretty effective approach on optimizing serde is converting to 
UnsafeRow and storing it on the same way we do in HDFSBackedStateStoreProvider, 
and vice versa. It has being running for 2.x versions, so the approach is 
considered as safe.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (SPARK-30946) FileStreamSourceLog/FileStreamSinkLog: leverage UnsafeRow type to serialize/deserialize entry

Reply via email to