[GitHub] spark pull request: [Spark-12374][SPARK-12150][SQL] Adding logical...

hvanhovell Fri, 18 Dec 2015 08:31:07 -0800

Github user hvanhovell commented on the pull request:

    https://github.com/apache/spark/pull/10335#issuecomment-165826708
  
    @gatorsmile I have been playing arround with this for a bit. Overall I 
think we should do this. I do have two things for you to consider.
    
    The current approach follows the formal route. It implements a 
LogicalOperator/PhysicalOperator and changes the planner. It also - as you 
stated - reuses quite a bit of the code in ```SparkContext```. The current 
```range``` operator slowness comes from the fact that we create a normal 
```Row``` for each element; this is expensive because it creates 1E9 objects 
and will try to convert each Row to an internal one. We could also just address 
these two issues directly, by being wrapping the iterator provided by 
```sc.range``` differently:
    
        def range(start: Long, end: Long, step: Long, numPartitions: Int): 
DataFrame = {
            val logicalPlan = LogicalRDD(
              AttributeReference("id", LongType, nullable = false)() :: Nil,
              sparkContext.range(start, end, step, numPartitions).mapPartitions 
({ i =>
                val unsafeRow = new UnsafeRow
                unsafeRow.pointTo(new Array[Byte](16), 1, 16)
                i.map { id =>
                  unsafeRow.setLong(0, id)
                  unsafeRow
                }
              }, true))(self)
            DataFrame(this, logicalPlan)
          }
    
    What do you think?
    
    My second point is about the benchmarking. I have been toying with this PR 
and the benchmarking code, and I am not sure that the current results are as 
revealing as they should be. I think the ```sqlContext.range``` code in this PR 
is nearly as fast as the ```sc.range``` code. The big difference is caused by 
the fact that the ```collect()``` involves serialization. Serializing a 
```Long``` is nowhere near as expensive as serializing an ```UnsafeRow```, in 
my benchmark of ```sqlContext.range``` serialization accounts for about 80-90% 
of the execution time (use the Spark Stage Timeline for this).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [Spark-12374][SPARK-12150][SQL] Adding logical...

Reply via email to