[GitHub] carbondata issue #2412: [CARBONDATA-2656] Presto vector stream readers perfo...

chenliang613 Sat, 30 Jun 2018 00:20:01 -0700

Github user chenliang613 commented on the issue:

    https://github.com/apache/carbondata/pull/2412
  
    Used the below script to build data:
    ```
    import scala.util.Random
    val r = new Random()
    val df = spark.sparkContext.parallelize(1 to 1000000000).map(x => ("No." + 
r.nextInt(10000), "country" + x % 8, "city" + x % 50, x % 300)).toDF("ID", 
"country", "city", "population")
    ```
    Two issues:
    1. On presto client, i ran two times as per the below script but get the 
different results:
    ```
    presto:default> select country,sum(population) from carbon_table group by 
country;
     country  |    _col1
    ----------+-------------
     country4 | 18508531250
     country2 | 18758431703
     country0 | 18508717865
     country7 | 18884021774
     country1 | 18633160595
     country5 | 18633480022
     country6 | 18757895175
     country3 | 18883151243
    (8 rows)
    
    Query 20180630_041406_00004_crn9q, FINISHED, 1 node
    Splits: 65 total, 65 done (100.00%)
    1:01 [1000M rows, 8.4GB] [16.5M rows/s, 142MB/s]
    
    presto:default> select country,sum(population) from carbon_table group by 
country;
     country  |    _col1
    ----------+-------------
     country4 | 18500014852
     country0 | 18499993972
     country5 | 18624989449
     country1 | 18625008398
     country3 | 18874966666
     country6 | 18749995166
     country7 | 18874992446
     country2 | 18749999687
    (8 rows)
    
    Query 20180630_041510_00005_crn9q, FINISHED, 1 node
    Splits: 65 total, 65 done (100.00%)
    0:59 [1000M rows, 8.4GB] [17M rows/s, 146MB/s]
    ```
    2. For aggregation scenarios with 1 billion row data, presto performance is 
much lower than spark, as below: (presto is around 1 mins, spark is around 33 
seconds)
    ```
    scala> benchmark { carbon.sql("select country,sum(population) from 
carbon_table group by country").show}
    +--------+---------------+
    | country|sum(population)|
    +--------+---------------+
    |country4|    18499998700|
    |country1|    18624998800|
    |country3|    18874998800|
    |country7|    18874998700|
    |country2|    18749998800|
    |country6|    18749998700|
    |country5|    18624998700|
    |country0|    18499998900|
    +--------+---------------+
    
    33849.999703ms
    ```

---

[GitHub] carbondata issue #2412: [CARBONDATA-2656] Presto vector stream readers perfo...

Reply via email to