[jira] [Updated] (HIVE-10746) Hive 1.2.0+Tez produces 1-byte FileSplits from mapred.TextInputFormat

Damien Carol (JIRA) Thu, 18 Jun 2015 13:52:48 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Damien Carol updated HIVE-10746:
--------------------------------
    Description: 
The following query: 
{code:sql}
SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount FROM adw.crc_arsn GROUP 
BY appl_user_id,arsn_cd ORDER BY appl_user_id;
{code}
 runs consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting 
to run this same query against Tez as the execution engine it consistently runs 
for over 300-500 seconds this seems extremely long. This is a basic external 
table delimited by tabs and is a single file in a folder. In Hive 0.13 this 
query with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now Hive 
1.2.0 and there clearly is something going awry with Hive w/Tez as an execution 
engine with Single or small file tables. I can attach further logs if someone 
needs them for deeper analysis.

HDFS Output:
{noformat}
hadoop fs -ls /example_dw/crc/arsn
Found 2 items
-rwxr-x---   6 loaduser hadoopusers          0 2015-05-17 20:03 
/example_dw/crc/arsn/_SUCCESS
-rwxr-x---   6 loaduser hadoopusers    3883880 2015-05-17 20:03 
/example_dw/crc/arsn/part-m-00000
{noformat}

Hive Table Describe:
{noformat}
hive> describe formatted crc_arsn;
OK
# col_name              data_type               comment             
                 
arsn_cd                 string                                      
clmlvl_cd               string                                      
arclss_cd               string                                      
arclssg_cd              string                                      
arsn_prcsr_rmk_ind      string                                      
arsn_mbr_rspns_ind      string                                      
savtyp_cd               string                                      
arsn_eff_dt             string                                      
arsn_exp_dt             string                                      
arsn_pstd_dts           string                                      
arsn_lstupd_dts         string                                      
arsn_updrsn_txt         string                                      
appl_user_id            string                                      
arsntyp_cd              string                                      
pre_d_indicator         string                                      
arsn_display_txt        string                                      
arstat_cd               string                                      
arsn_tracking_no        string                                      
arsn_cstspcfc_ind       string                                      
arsn_mstr_rcrd_ind      string                                      
state_specific_ind      string                                      
region_specific_in      string                                      
arsn_dpndnt_cd          string                                      
unit_adjustment_in      string                                      
arsn_mbr_only_ind       string                                      
arsn_qrmb_ind           string                                      
                 
# Detailed Table Information             
Database:               adw                      
Owner:                  [email protected]   
CreateTime:             Mon Apr 28 13:28:05 EDT 2014     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn   
     
Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        transient_lastDdlTime   1398706085          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        line.delim              \n                  
        serialization.format    \t                  
Time taken: 1.245 seconds, Fetched: 54 row(s)
{noformat}


Explain Hive 1.2.0 w/Tez:
{noformat}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)


Explain Hive 0.13 w/Tez:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE) 
{noformat}

Results:
{noformat}
        Hive 1.2.0 w/Spark 1.3.1:
                Finished successfully in 7.09 seconds

        Hive 1.2.0 w/Mapreduce:
                Stage 1: 32 Seconds
                Stage 2: 35 Seconds

        Hive 1.2.0 w/Tez 0.5.3:
                Time taken: 565.025 seconds, Fetched: 11516 row(s)
        
        Hive 0.13 w/Tez 0.4.0:
                Time taken: 13.552 seconds, Fetched: 11516 row(s)
{noformat}

And finally looking at the Dag Attempt that is stuck for 500 seconds or so in 
Tez it looks to be stuck running the same method over and over again:

{noformat}
8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = 
hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml
2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records read 
- 1
2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000

......


2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG 
method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 
from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. 
closing... 
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:13440
{noformat}

  was:
The following query: "SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount 
FROM adw.crc_arsn GROUP BY appl_user_id,arsn_cd ORDER BY appl_user_id;" runs 
consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting to run 
this same query against Tez as the execution engine it consistently runs for 
over 300-500 seconds this seems extremely long. This is a basic external table 
delimited by tabs and is a single file in a folder. In Hive 0.13 this query 
with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now Hive 1.2.0 
and there clearly is something going awry with Hive w/Tez as an execution 
engine with Single or small file tables. I can attach further logs if someone 
needs them for deeper analysis.

HDFS Output:
hadoop fs -ls /example_dw/crc/arsn
Found 2 items
-rwxr-x---   6 loaduser hadoopusers          0 2015-05-17 20:03 
/example_dw/crc/arsn/_SUCCESS
-rwxr-x---   6 loaduser hadoopusers    3883880 2015-05-17 20:03 
/example_dw/crc/arsn/part-m-00000


Hive Table Describe:
{code}
hive> describe formatted crc_arsn;
OK
# col_name              data_type               comment             
                 
arsn_cd                 string                                      
clmlvl_cd               string                                      
arclss_cd               string                                      
arclssg_cd              string                                      
arsn_prcsr_rmk_ind      string                                      
arsn_mbr_rspns_ind      string                                      
savtyp_cd               string                                      
arsn_eff_dt             string                                      
arsn_exp_dt             string                                      
arsn_pstd_dts           string                                      
arsn_lstupd_dts         string                                      
arsn_updrsn_txt         string                                      
appl_user_id            string                                      
arsntyp_cd              string                                      
pre_d_indicator         string                                      
arsn_display_txt        string                                      
arstat_cd               string                                      
arsn_tracking_no        string                                      
arsn_cstspcfc_ind       string                                      
arsn_mstr_rcrd_ind      string                                      
state_specific_ind      string                                      
region_specific_in      string                                      
arsn_dpndnt_cd          string                                      
unit_adjustment_in      string                                      
arsn_mbr_only_ind       string                                      
arsn_qrmb_ind           string                                      
                 
# Detailed Table Information             
Database:               adw                      
Owner:                  [email protected]   
CreateTime:             Mon Apr 28 13:28:05 EDT 2014     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn   
     
Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        transient_lastDdlTime   1398706085          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        line.delim              \n                  
        serialization.format    \t                  
Time taken: 1.245 seconds, Fetched: 54 row(s)

{code}


Explain Hive 1.2.0 w/Tez:
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)


Explain Hive 0.13 w/Tez:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE) 
{code}

Results:
{code}
        Hive 1.2.0 w/Spark 1.3.1:
                Finished successfully in 7.09 seconds

        Hive 1.2.0 w/Mapreduce:
                Stage 1: 32 Seconds
                Stage 2: 35 Seconds

        Hive 1.2.0 w/Tez 0.5.3:
                Time taken: 565.025 seconds, Fetched: 11516 row(s)
        
        Hive 0.13 w/Tez 0.4.0:
                Time taken: 13.552 seconds, Fetched: 11516 row(s)
{code}

And finally looking at the Dag Attempt that is stuck for 500 seconds or so in 
Tez it looks to be stuck running the same method over and over again:

{code}
8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = 
hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml
2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records read 
- 1
2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000

......


2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG 
method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 
from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. 
closing... 
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:13440
{code}


>  Hive 1.2.0+Tez produces 1-byte FileSplits from mapred.TextInputFormat
> ----------------------------------------------------------------------
>
>                 Key: HIVE-10746
>                 URL: https://issues.apache.org/jira/browse/HIVE-10746
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Tez
>    Affects Versions: 0.14.0, 0.14.1, 1.2.0, 1.1.0, 1.1.1
>            Reporter: Greg Senia
>            Assignee: Gopal V
>            Priority: Critical
>             Fix For: 1.2.1
>
>         Attachments: HIVE-10746.1.patch, HIVE-10746.2.patch, 
> slow_query_output.zip
>
>
> The following query: 
> {code:sql}
> SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount FROM adw.crc_arsn GROUP 
> BY appl_user_id,arsn_cd ORDER BY appl_user_id;
> {code}
>  runs consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting 
> to run this same query against Tez as the execution engine it consistently 
> runs for over 300-500 seconds this seems extremely long. This is a basic 
> external table delimited by tabs and is a single file in a folder. In Hive 
> 0.13 this query with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 
> and now Hive 1.2.0 and there clearly is something going awry with Hive w/Tez 
> as an execution engine with Single or small file tables. I can attach further 
> logs if someone needs them for deeper analysis.
> HDFS Output:
> {noformat}
> hadoop fs -ls /example_dw/crc/arsn
> Found 2 items
> -rwxr-x---   6 loaduser hadoopusers          0 2015-05-17 20:03 
> /example_dw/crc/arsn/_SUCCESS
> -rwxr-x---   6 loaduser hadoopusers    3883880 2015-05-17 20:03 
> /example_dw/crc/arsn/part-m-00000
> {noformat}
> Hive Table Describe:
> {noformat}
> hive> describe formatted crc_arsn;
> OK
> # col_name              data_type               comment             
>                  
> arsn_cd                 string                                      
> clmlvl_cd               string                                      
> arclss_cd               string                                      
> arclssg_cd              string                                      
> arsn_prcsr_rmk_ind      string                                      
> arsn_mbr_rspns_ind      string                                      
> savtyp_cd               string                                      
> arsn_eff_dt             string                                      
> arsn_exp_dt             string                                      
> arsn_pstd_dts           string                                      
> arsn_lstupd_dts         string                                      
> arsn_updrsn_txt         string                                      
> appl_user_id            string                                      
> arsntyp_cd              string                                      
> pre_d_indicator         string                                      
> arsn_display_txt        string                                      
> arstat_cd               string                                      
> arsn_tracking_no        string                                      
> arsn_cstspcfc_ind       string                                      
> arsn_mstr_rcrd_ind      string                                      
> state_specific_ind      string                                      
> region_specific_in      string                                      
> arsn_dpndnt_cd          string                                      
> unit_adjustment_in      string                                      
> arsn_mbr_only_ind       string                                      
> arsn_qrmb_ind           string                                      
>                  
> # Detailed Table Information             
> Database:               adw                      
> Owner:                  [email protected]   
> CreateTime:             Mon Apr 28 13:28:05 EDT 2014     
> LastAccessTime:         UNKNOWN                  
> Protect Mode:           None                     
> Retention:              0                        
> Location:               hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn 
>        
> Table Type:             EXTERNAL_TABLE           
> Table Parameters:                
>         EXTERNAL                TRUE                
>         transient_lastDdlTime   1398706085          
>                  
> # Storage Information            
> SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe    
>    
> InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
> OutputFormat:           
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
> Compressed:             No                       
> Num Buckets:            -1                       
> Bucket Columns:         []                       
> Sort Columns:           []                       
> Storage Desc Params:             
>         field.delim             \t                  
>         line.delim              \n                  
>         serialization.format    \t                  
> Time taken: 1.245 seconds, Fetched: 54 row(s)
> {noformat}
> Explain Hive 1.2.0 w/Tez:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
> Explain Hive 0.13 w/Tez:
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE) 
> {noformat}
> Results:
> {noformat}
>       Hive 1.2.0 w/Spark 1.3.1:
>               Finished successfully in 7.09 seconds
>       Hive 1.2.0 w/Mapreduce:
>               Stage 1: 32 Seconds
>               Stage 2: 35 Seconds
>       Hive 1.2.0 w/Tez 0.5.3:
>               Time taken: 565.025 seconds, Fetched: 11516 row(s)
>       
>       Hive 0.13 w/Tez 0.4.0:
>               Time taken: 13.552 seconds, Fetched: 11516 row(s)
> {noformat}
> And finally looking at the Dag Attempt that is stuck for 500 seconds or so in 
> Tez it looks to be stuck running the same method over and over again:
> {noformat}
> 8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = 
> hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml
> 2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records 
> read - 1
> 2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> ......
> 2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG 
> method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 
> from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
> 2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. 
> closing... 
> 2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 
> RECORDS_IN_Map_1:13440
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10746) Hive 1.2.0+Tez produces 1-byte FileSplits from mapred.TextInputFormat

Reply via email to