[GitHub] spark pull request: Coding Task

esoroush Sat, 19 Apr 2014 14:48:56 -0700

GitHub user esoroush opened a pull request:

    https://github.com/apache/spark/pull/456


    Coding Task

    The goal is to improve the performance of the HiveTableScan Operator:
    
    As a quick benchmark run the following code in the scala interpreter:
    
    scala> :paste
    
    hql("CREATE TABLE IF NOT EXISTS sample (key1 INT, key2 INT,value STRING) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','")
    hql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/sample2.txt' INTO 
TABLE sample")
    println("Result of SELECT * FROM sample:")
    val start = System.nanoTime
    val recs = hql("FROM sample SELECT key1,key2,value").collect()
    val micros = (System.nanoTime - start) / 1000
    println("%d microsecondss".format(micros))
    
    scala> CTRL-D
    
    you can download the test file from here: 
    http://homes.cs.washington.edu/~soroush/sample2.txt
    
    "sample2.txt contains about 3.6 million rows. The improved code scans the 
entire table in about 9 seconds while the original code scans the entire table 
in about 22 seconds.
    
    
    Regarding the last item in the task: 
    "Avoid Reading Unneeded Data - Some Hive Serializer/Deserializer (SerDe) 
interfaces support reading only the required columns from the underlying HDFS 
files.  We should use ColumnProjectionUtils to configure these correctly." 
    The way to do it, should be similar to the following code: 
    
    
https://github.com/amplab/shark/blob/master/src/main/scala/shark/execution/TableScanOperator.scala
    
    I tried to take a similar approach, but I am not sure columnar reading is 
working at hiveOperators.scala right now. Anyway, it requires more time for me 
to make sure that last feature is working. Please notice that it was the first 
time that I wrote code in scala and it took me some time to get comfortable 
with the language.  
     
    
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/esoroush/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/456.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #456
    
----
commit bd6894171c537627c09d04451ca1725dfc8228ae
Author: Emad Soroush <[email protected]>
Date:   2014-04-19T21:11:08Z

    test code commit

commit ddc1c2398deb56fdc08a36cc032c329ccdedc73b
Author: esoroush <[email protected]>
Date:   2014-04-19T21:30:52Z

    code task 2

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: Coding Task

Reply via email to