[GitHub] spark pull request: [SPARK-11117] [SPARK-11345] [SQL] Makes all Ha...

liancheng Tue, 27 Oct 2015 09:17:29 -0700

GitHub user liancheng opened a pull request:

    https://github.com/apache/spark/pull/9305


    [SPARK-11117] [SPARK-11345] [SQL] Makes all HadoopFsRelation data sources 
produce UnsafeRow

    This PR fixes two issues:
    
    1.  `PhysicalRDD.outputsUnsafeRows` is always `false`
    
        Thus a `ConvertToUnsafe` operator is often required even if the 
underlying data source relation does output `UnsafeRow`.
    
    1.  Internal/external row conversion for `HadoopFsRelation` is kinda messy
    
        Currently we're using `HadoopFsRelation.needConversion` and [dirty type 
erasure hacks][1] to indicate whether the relation outputs external row or 
internal row and apply external-to-internal conversion when necessary.  
Basically, all builtin `HadoopFsRelation` data sources, i.e. Parquet, JSON, and 
ORC output `InternalRow`, while typical external `HadoopFsRelation` data 
sources, e.g. spark-avro and spark-csv, output `Row`.
    
    This PR adds a `private[sql]` interface method 
`HadoopFsRelation.buildInternalScan`, which by default invokes 
`HadoopFsRelation.buildScan` and converts `Row`s to `UnsafeRow`s (which are 
also `InternalRow`s).  All builtin `HadoopFsRelation` data sources override 
this method and directly output `UnsafeRow`s.  In this way, now 
`HadoopFsRelation` always produces `UnsafeRow`s. Thus 
`PhysicalRDD.outputsUnsafeRows` can be properly set by checking whether the 
underlying data source is a `HadoopFsRelation`.
    
    A remaining question is that, can we assume that all non-builtin 
`HadoopFsRelation` data sources output external rows?  At least all well known 
ones do so.  However it's possible that some users implemented their own 
`HadoopFsRelation` data sources that leverages `InternalRow` and thus all those 
unstable internal data representations.  If this assumption is safe, we can 
deprecate `HadoopFsRelation.needConversion` and cleanup some more conversion 
code (like [here][2] and [here][3]).
    
    This PR supersedes #9125.
    
    Follow-ups:
    
    1.  Makes JSON and ORC data sources output `UnsafeRow` directly
    
    1.  Makes `HiveTableScan` output `UnsafeRow` directly
    
        This is related to 1 since ORC data source shares the same `Writable` 
unwrapping code with `HiveTableScan`.
    
    [1]: 
https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala#L353
    [2]: 
https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala#L331-L335
    [3]: 
https://github.com/apache/spark/blob/v1.5.1/sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala#L630-L669


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/liancheng/spark 
spark-11345.unsafe-hadoop-fs-relation

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9305.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9305
    
----
commit c161bf9c5e6a1c94751f96da99a0df0e44f75e61
Author: Cheng Lian <[email protected]>
Date:   2015-10-27T13:34:08Z

    Makes all HadoopFsRelation data sources produce UnsafeRow

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-11117] [SPARK-11345] [SQL] Makes all Ha...

Reply via email to