[ 
https://issues.apache.org/jira/browse/DRILL-4982?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15713740#comment-15713740
 ] 

ASF GitHub Bot commented on DRILL-4982:
---------------------------------------

Github user bitblender commented on a diff in the pull request:

    https://github.com/apache/drill/pull/638#discussion_r90574793
  
    --- Diff: 
contrib/storage-hive/core/src/main/java/org/apache/drill/exec/store/hive/HiveAbstractReader.java
 ---
    @@ -218,17 +229,18 @@ private void init() throws ExecutionSetupException {
             }
           }
         } catch (Exception e) {
    -      throw new ExecutionSetupException("Failure while initializing 
HiveRecordReader: " + e.getMessage(), e);
    +      throw new ExecutionSetupException("Failure while initializing 
HiveOrcReader: " + e.getMessage(), e);
    --- End diff --
    
    "HiveOrcReader:" in the exception-string should be changed so that it 
prints the actual reader type. 


> Hive Queries degrade when queries switch between different formats
> ------------------------------------------------------------------
>
>                 Key: DRILL-4982
>                 URL: https://issues.apache.org/jira/browse/DRILL-4982
>             Project: Apache Drill
>          Issue Type: Bug
>            Reporter: Chunhui Shi
>            Assignee: Karthikeyan Manivannan
>            Priority: Critical
>
> We have seen degraded performance by doing these steps:
> 1) generate the repro data:
> python script repro.py as below:
> import string
> import random
>  
> for i in range(30000000):
>     x1 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
>     x2 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
>     x3 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
>     x4 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
>     x5 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
>     x6 = ''.join(random.choice(string.ascii_uppercase + string.digits) for _ 
> in range(random.randrange(19, 27)))
>     print 
> "{0}".format(x1),"{0}".format(x2),"{0}".format(x3),"{0}".format(x4),"{0}".format(x5),"{0}".format(x6)
> python repro.py > repro.csv
> 2) put these files in a dfs directory e.g. '/tmp/hiveworkspace/plain'. Under 
> hive prompt, use the following sql command to create an external table:
> CREATE EXTERNAL TABLE `hiveworkspace`.`plain` (`id1` string, `id2` string, 
> `id3` string, `id4` string, `id5` string, `id6` string) ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.serde2.OpenCSVSerde' STORED AS TEXTFILE LOCATION 
> '/tmp/hiveworkspace/plain'
> 3) create Hive's table of ORC|PARQUET format:
> CREATE TABLE `hiveworkspace`.`plainorc` STORED AS ORC AS SELECT 
> id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
> CREATE TABLE `hiveworkspace`.`plainparquet` STORED AS PARQUET AS SELECT 
> id1,id2,id3,id4,id5,id6 from `hiveworkspace`.`plain`;
> 4) Query switch between these two tables, then the query time on the same 
> table significantly lengthened. On my setup, for ORC, it was 15sec -> 26secs. 
> Queries on table of other formats, after injecting a query to other formats, 
> all have significant slow down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to