[
https://issues.apache.org/jira/browse/SPARK-15280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Yin Huai updated SPARK-15280:
-----------------------------
Description:
Summary:
This is a proposal to move ORC serialization logic from OrcOutputWriter to a
new class (OrcSerializer) which can be re-used to serialize an InternalRow to a
Writable object so it can be written to a ORC file via RecordWriter.
Details:
Since Spark doesn't support SMB join yet, we would like to do SMB join at Spark
application side. Using DataFrame for reading and writing to a ORC file is
easier but we also wanted to parallelize it so we can have 1 task per each Hive
bucket. This approach didn't work because nested RDD's are not supported
(cannot create/read a DF at executor).
The workaround is creating a ORC reader & writer rather than DataFrame at each
executor. For reading ORC file OrcFile.createReader works fine. In order to
write to a ORC file OrcOutputFormat().getRecordWriter would do the trick.
However the missing part is serialization of InternalRow to a Writable object.
In order to reuse the serialization part, I am proposing to split the
OrcOutputWriter into OrcOutputWriter and OrcSerializer so we can reuse the
serialization logic.
was:
Summary:
This is a proposal to move ORC serialization logic from OrcOutputWriter to a
new class (OrcSerializer) which can be re-used to serialize an InternalRow to a
Writable object so it can be written to a ORC file via RecordWriter.
Details:
Since Spark doesn't support SMB join yet, we would like to do SMB join at Spark
application side. Using DataFrame for reading and writing to a ORC file is
easier but we also wanted to parallelize it so we can have 1 task per each Hive
bucket. This approach didn't work because nested RDD's are not supported
(cannot create/read a DF at executor).
The workaround is creating a ORC reader & writer rather than DataFrame at each
executor. For reading ORC file OrcFile.createReader works fine. In order to
write to a ORC file OrcOutputFormat().getRecordWriter would do the trick.
However the missing part is serialization of InternalRow to a Writable object.
In order to reuse the serialization part, I am proposing to split the
OrcOutputWriter into OrcOutputWriter and OrcSerializer (public) so we can reuse
the serialization logic.
> Extract ORC serialization logic from OrcOutputWriter for reusability
> ---------------------------------------------------------------------
>
> Key: SPARK-15280
> URL: https://issues.apache.org/jira/browse/SPARK-15280
> Project: Spark
> Issue Type: Improvement
> Components: Input/Output
> Reporter: Ergin Seyfe
> Assignee: Ergin Seyfe
> Priority: Minor
> Fix For: 2.0.0
>
>
> Summary:
> This is a proposal to move ORC serialization logic from OrcOutputWriter to a
> new class (OrcSerializer) which can be re-used to serialize an InternalRow to
> a Writable object so it can be written to a ORC file via RecordWriter.
> Details:
> Since Spark doesn't support SMB join yet, we would like to do SMB join at
> Spark application side. Using DataFrame for reading and writing to a ORC file
> is easier but we also wanted to parallelize it so we can have 1 task per each
> Hive bucket. This approach didn't work because nested RDD's are not supported
> (cannot create/read a DF at executor).
> The workaround is creating a ORC reader & writer rather than DataFrame at
> each executor. For reading ORC file OrcFile.createReader works fine. In order
> to write to a ORC file OrcOutputFormat().getRecordWriter would do the trick.
> However the missing part is serialization of InternalRow to a Writable
> object. In order to reuse the serialization part, I am proposing to split the
> OrcOutputWriter into OrcOutputWriter and OrcSerializer so we can reuse the
> serialization logic.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]