[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

marmbrus Fri, 16 May 2014 07:03:12 -0700

Github user marmbrus commented on the pull request:

    https://github.com/apache/spark/pull/758#issuecomment-43242728
  
    > @marmbrus I worked around the test failure by adding a SortedOperation 
pattern that conservatively matches some definitely sorted operations (false 
negative rather than false positive). This may slow down the test suite a bit. 
Since most test output are empty or very small, this shouldn't be an issue 
right now.
    
    I think false negatives are the wrong direction to go here.  A false 
negative means that we think the query is not ordered when it should be and 
thus are disregarding the order when we should in fact be checking it.
    
    Maybe it would be better to recursively walk the tree looking explicitly 
for nodes that do not preserve order (aggregation, join, base relations) and 
then return false.  Sorts would return true.  Thoughts?
    
    > New micro benchmark data:
    
    Sweet, looks like we shaved off a little bit more, so these optimizations 
were worth it!  It would be good to make notes on which changes lead to what 
kind of speed up here.  That way, we can better focus our efforts when we 
optimize in the future.
    
    > As for Hive data unwrapping, I couldn't find a "static" method to 
eliminate right now. Any hints?
    
    My thought was that you will create an `Array` of `Any => Any` functions 
that can be applied to each column.  This way you only match on the datatype 
once, at the beginning, and then simply index into this array instead of 
matching for each data item.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-1368][SQL] Optimized HiveTableScan

Reply via email to