GitHub user liancheng opened a pull request:
https://github.com/apache/spark/pull/9305
[SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources
produce UnsafeRow
This PR fixes two issues:
1. `PhysicalRDD.outputsUnsafeRows` is always `false`
Thus a `ConvertToUnsafe` operator is often required even if the
underlying data source relation does output `UnsafeRow`.
1. Internal/external row conversion for `HadoopFsRelation` is kinda messy
Currently we're using `HadoopFsRelation.needConversion` and [dirty type
erasure hacks][1] to indicate whether the relation outputs external row or
internal row and apply external-to-internal conversion when necessary.
Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, and
ORC output `InternalRow`, while typical external `HadoopFsRelation` data
sources, e.g. spark-avro and spark-csv, output `Row`.
This PR adds a `private[sql]` interface method
`HadoopFsRelation.buildInternalScan`, which by default invokes
`HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are
also `InternalRow`s). All builtin `HadoopFsRelation` data sources override
this method and directly output `UnsafeRow`s. In this way, now
`HadoopFsRelation` always produces `UnsafeRow`s. Thus
`PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the
underlying data source is a `HadoopFsRelation`.
A remaining question is that, can we assume that all non-builtin
`HadoopFsRelation` data sources output external rows? At least all well known
ones do so. However it's possible that some users implemented their own
`HadoopFsRelation` data sources that leverages `InternalRow` and thus all those
unstable internal data representations. If this assumption is safe, we can
deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion
code (like [here][2] and [here][3]).
This PR supersedes #9125.
Follow-ups:
1. Makes JSON and ORC data sources output `UnsafeRow` directly
1. Makes `HiveTableScan` output `UnsafeRow` directly
This is related to 1 since ORC data source shares the same `Writable`
unwrapping code with `HiveTableScan`.
[1]:
https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353
[2]:
https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335
[3]:
https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/liancheng/spark
spark-11345.unsafe-hadoop-fs-relation
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9305.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9305
----
commit c161bf9c5e6a1c94751f96da99a0df0e44f75e61
Author: Cheng Lian <[email protected]>
Date: 2015-10-27T13:34:08Z
Makes all HadoopFsRelation data sources produce UnsafeRow
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]