[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...

cloud-fan Mon, 23 Apr 2018 05:47:31 -0700

Github user cloud-fan commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21118#discussion_r183380926
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanUnsafeRow.java
 ---
    @@ -1,46 +0,0 @@
    -/*
    - * Licensed to the Apache Software Foundation (ASF) under one or more
    - * contributor license agreements.  See the NOTICE file distributed with
    - * this work for additional information regarding copyright ownership.
    - * The ASF licenses this file to You under the Apache License, Version 2.0
    - * (the "License"); you may not use this file except in compliance with
    - * the License.  You may obtain a copy of the License at
    - *
    - *    http://www.apache.org/licenses/LICENSE-2.0
    - *
    - * Unless required by applicable law or agreed to in writing, software
    - * distributed under the License is distributed on an "AS IS" BASIS,
    - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    - * See the License for the specific language governing permissions and
    - * limitations under the License.
    - */
    -
    -package org.apache.spark.sql.sources.v2.reader;
    -
    -import java.util.List;
    -
    -import org.apache.spark.annotation.InterfaceStability;
    -import org.apache.spark.sql.Row;
    -import org.apache.spark.sql.catalyst.expressions.UnsafeRow;
    -
    -/**
    - * A mix-in interface for {@link DataSourceReader}. Data source readers 
can implement this
    - * interface to output {@link UnsafeRow} directly and avoid the row copy 
at Spark side.
    - * This is an experimental and unstable interface, as {@link UnsafeRow} is 
not public and may get
    - * changed in the future Spark versions.
    - */
    -@InterfaceStability.Unstable
    -public interface SupportsScanUnsafeRow extends DataSourceReader {
    --- End diff --
    
    I think we still need this trait.
    
    In Spark SQL there is a contract that all operators produce `UnsafeRow`, 
except some Dataset related operators. We have operators assuming the input 
rows are `UnsafeRow` and do type cast, e.g. operators which use 
`GenerateUnsafeRowJoiner`.
    
    That is to say, the data source scan has to do an unsafe projection to make 
sure it produces unsafe rows. This will be a waste if the data source already 
produces unsafe rows.
    
    For file-based data source, we solve this issue by adding a flag 
`needsUnsafeRowConversion` to decide if we need the unsafe projection or not. 
Another solution is what @rdblue proposed in the dev list discussion: do 
`isInstanOf[UnsafeRow]` check for each input row and skip unsafe projection if 
it's already unsafe row. That may have a performance penalty because of the 
per-row check.
    
    So this trait is still useful, at least for the built-in file-based data 
sources.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...

Reply via email to