Tez 0.6.0 produces 1-byte FileSplits from TextInputFormat

Gopal V (JIRA) Thu, 18 Jun 2015 10:05:16 -0700

     [ 
https://issues.apache.org/jira/browse/HIVE-10746?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Gopal V updated HIVE-10746:
---------------------------
    Description: 
The following query: "SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount 
FROM adw.crc_arsn GROUP BY appl_user_id,arsn_cd ORDER BY appl_user_id;" runs 
consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting to run 
this same query against Tez as the execution engine it consistently runs for 
over 300-500 seconds this seems extremely long. This is a basic external table 
delimited by tabs and is a single file in a folder. In Hive 0.13 this query 
with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now Hive 1.2.0 
and there clearly is something going awry with Hive w/Tez as an execution 
engine with Single or small file tables. I can attach further logs if someone 
needs them for deeper analysis.

HDFS Output:
hadoop fs -ls /example_dw/crc/arsn
Found 2 items
-rwxr-x---   6 loaduser hadoopusers          0 2015-05-17 20:03 
/example_dw/crc/arsn/_SUCCESS
-rwxr-x---   6 loaduser hadoopusers    3883880 2015-05-17 20:03 
/example_dw/crc/arsn/part-m-00000


Hive Table Describe:
{code}
hive> describe formatted crc_arsn;
OK
# col_name              data_type               comment             
                 
arsn_cd                 string                                      
clmlvl_cd               string                                      
arclss_cd               string                                      
arclssg_cd              string                                      
arsn_prcsr_rmk_ind      string                                      
arsn_mbr_rspns_ind      string                                      
savtyp_cd               string                                      
arsn_eff_dt             string                                      
arsn_exp_dt             string                                      
arsn_pstd_dts           string                                      
arsn_lstupd_dts         string                                      
arsn_updrsn_txt         string                                      
appl_user_id            string                                      
arsntyp_cd              string                                      
pre_d_indicator         string                                      
arsn_display_txt        string                                      
arstat_cd               string                                      
arsn_tracking_no        string                                      
arsn_cstspcfc_ind       string                                      
arsn_mstr_rcrd_ind      string                                      
state_specific_ind      string                                      
region_specific_in      string                                      
arsn_dpndnt_cd          string                                      
unit_adjustment_in      string                                      
arsn_mbr_only_ind       string                                      
arsn_qrmb_ind           string                                      
                 
# Detailed Table Information             
Database:               adw                      
Owner:                  [email protected]   
CreateTime:             Mon Apr 28 13:28:05 EDT 2014     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn   
     
Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        transient_lastDdlTime   1398706085          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        line.delim              \n                  
        serialization.format    \t                  
Time taken: 1.245 seconds, Fetched: 54 row(s)

{code}


Explain Hive 1.2.0 w/Tez:
{code}
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)


Explain Hive 0.13 w/Tez:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE) 
{code}

Results:
{code}
        Hive 1.2.0 w/Spark 1.3.1:
                Finished successfully in 7.09 seconds

        Hive 1.2.0 w/Mapreduce:
                Stage 1: 32 Seconds
                Stage 2: 35 Seconds

        Hive 1.2.0 w/Tez 0.5.3:
                Time taken: 565.025 seconds, Fetched: 11516 row(s)
        
        Hive 0.13 w/Tez 0.4.0:
                Time taken: 13.552 seconds, Fetched: 11516 row(s)
{code}

And finally looking at the Dag Attempt that is stuck for 500 seconds or so in 
Tez it looks to be stuck running the same method over and over again:

{code}
8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = 
hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml
2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records read 
- 1
2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000

......


2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG 
method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 
from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. 
closing... 
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:13440
{code}

  was:
The following query: "SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount 
FROM adw.crc_arsn GROUP BY appl_user_id,arsn_cd ORDER BY appl_user_id;" runs 
consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting to run 
this same query against Tez as the execution engine it consistently runs for 
over 300-500 seconds this seems extremely long. This is a basic external table 
delimited by tabs and is a single file in a folder. In Hive 0.13 this query 
with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now Hive 1.2.0 
and there clearly is something going awry with Hive w/Tez as an execution 
engine with Single or small file tables. I can attach further logs if someone 
needs them for deeper analysis.

HDFS Output:
hadoop fs -ls /example_dw/crc/arsn
Found 2 items
-rwxr-x---   6 loaduser hadoopusers          0 2015-05-17 20:03 
/example_dw/crc/arsn/_SUCCESS
-rwxr-x---   6 loaduser hadoopusers    3883880 2015-05-17 20:03 
/example_dw/crc/arsn/part-m-00000


Hive Table Describe:
hive> describe formatted crc_arsn;
OK
# col_name              data_type               comment             
                 
arsn_cd                 string                                      
clmlvl_cd               string                                      
arclss_cd               string                                      
arclssg_cd              string                                      
arsn_prcsr_rmk_ind      string                                      
arsn_mbr_rspns_ind      string                                      
savtyp_cd               string                                      
arsn_eff_dt             string                                      
arsn_exp_dt             string                                      
arsn_pstd_dts           string                                      
arsn_lstupd_dts         string                                      
arsn_updrsn_txt         string                                      
appl_user_id            string                                      
arsntyp_cd              string                                      
pre_d_indicator         string                                      
arsn_display_txt        string                                      
arstat_cd               string                                      
arsn_tracking_no        string                                      
arsn_cstspcfc_ind       string                                      
arsn_mstr_rcrd_ind      string                                      
state_specific_ind      string                                      
region_specific_in      string                                      
arsn_dpndnt_cd          string                                      
unit_adjustment_in      string                                      
arsn_mbr_only_ind       string                                      
arsn_qrmb_ind           string                                      
                 
# Detailed Table Information             
Database:               adw                      
Owner:                  [email protected]   
CreateTime:             Mon Apr 28 13:28:05 EDT 2014     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn   
     
Table Type:             EXTERNAL_TABLE           
Table Parameters:                
        EXTERNAL                TRUE                
        transient_lastDdlTime   1398706085          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe      
 
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        line.delim              \n                  
        serialization.format    \t                  
Time taken: 1.245 seconds, Fetched: 54 row(s)




Explain Hive 1.2.0 w/Tez:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)


Explain Hive 0.13 w/Tez:
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 is a root stage

STAGE PLANS:
  Stage: Stage-1
    Tez
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE) 

Results:
        Hive 1.2.0 w/Spark 1.3.1:
                Finished successfully in 7.09 seconds

        Hive 1.2.0 w/Mapreduce:
                Stage 1: 32 Seconds
                Stage 2: 35 Seconds

        Hive 1.2.0 w/Tez 0.5.3:
                Time taken: 565.025 seconds, Fetched: 11516 row(s)
        
        Hive 0.13 w/Tez 0.4.0:
                Time taken: 13.552 seconds, Fetched: 11516 row(s)

And finally looking at the Dag Attempt that is stuck for 500 seconds or so in 
Tez it looks to be stuck running the same method over and over again:

8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = 
hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml
2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records read 
- 1
2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000

......


2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: 
Processing file 
hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG 
method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 
from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. 
closing... 
2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: RECORDS_IN_Map_1:13440


>  Hive 1.2.0 w/ Tez 0.5.3/Tez 0.6.0 produces 1-byte FileSplits from 
> TextInputFormat
> ----------------------------------------------------------------------------------
>
>                 Key: HIVE-10746
>                 URL: https://issues.apache.org/jira/browse/HIVE-10746
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Tez
>    Affects Versions: 0.14.0, 0.14.1, 1.2.0, 1.1.0, 1.1.1
>            Reporter: Greg Senia
>            Assignee: Gopal V
>            Priority: Critical
>         Attachments: HIVE-10746.1.patch, HIVE-10746.2.patch, 
> slow_query_output.zip
>
>
> The following query: "SELECT appl_user_id, arsn_cd, COUNT(*) as RecordCount 
> FROM adw.crc_arsn GROUP BY appl_user_id,arsn_cd ORDER BY appl_user_id;" runs 
> consistently fast in Spark and Mapreduce on Hive 1.2.0. When attempting to 
> run this same query against Tez as the execution engine it consistently runs 
> for over 300-500 seconds this seems extremely long. This is a basic external 
> table delimited by tabs and is a single file in a folder. In Hive 0.13 this 
> query with Tez runs fast and I tested with Hive 0.14, 0.14.1/1.0.0 and now 
> Hive 1.2.0 and there clearly is something going awry with Hive w/Tez as an 
> execution engine with Single or small file tables. I can attach further logs 
> if someone needs them for deeper analysis.
> HDFS Output:
> hadoop fs -ls /example_dw/crc/arsn
> Found 2 items
> -rwxr-x---   6 loaduser hadoopusers          0 2015-05-17 20:03 
> /example_dw/crc/arsn/_SUCCESS
> -rwxr-x---   6 loaduser hadoopusers    3883880 2015-05-17 20:03 
> /example_dw/crc/arsn/part-m-00000
> Hive Table Describe:
> {code}
> hive> describe formatted crc_arsn;
> OK
> # col_name              data_type               comment             
>                  
> arsn_cd                 string                                      
> clmlvl_cd               string                                      
> arclss_cd               string                                      
> arclssg_cd              string                                      
> arsn_prcsr_rmk_ind      string                                      
> arsn_mbr_rspns_ind      string                                      
> savtyp_cd               string                                      
> arsn_eff_dt             string                                      
> arsn_exp_dt             string                                      
> arsn_pstd_dts           string                                      
> arsn_lstupd_dts         string                                      
> arsn_updrsn_txt         string                                      
> appl_user_id            string                                      
> arsntyp_cd              string                                      
> pre_d_indicator         string                                      
> arsn_display_txt        string                                      
> arstat_cd               string                                      
> arsn_tracking_no        string                                      
> arsn_cstspcfc_ind       string                                      
> arsn_mstr_rcrd_ind      string                                      
> state_specific_ind      string                                      
> region_specific_in      string                                      
> arsn_dpndnt_cd          string                                      
> unit_adjustment_in      string                                      
> arsn_mbr_only_ind       string                                      
> arsn_qrmb_ind           string                                      
>                  
> # Detailed Table Information             
> Database:               adw                      
> Owner:                  [email protected]   
> CreateTime:             Mon Apr 28 13:28:05 EDT 2014     
> LastAccessTime:         UNKNOWN                  
> Protect Mode:           None                     
> Retention:              0                        
> Location:               hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn 
>        
> Table Type:             EXTERNAL_TABLE           
> Table Parameters:                
>         EXTERNAL                TRUE                
>         transient_lastDdlTime   1398706085          
>                  
> # Storage Information            
> SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe    
>    
> InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
> OutputFormat:           
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
> Compressed:             No                       
> Num Buckets:            -1                       
> Bucket Columns:         []                       
> Sort Columns:           []                       
> Storage Desc Params:             
>         field.delim             \t                  
>         line.delim              \n                  
>         serialization.format    \t                  
> Time taken: 1.245 seconds, Fetched: 54 row(s)
> {code}
> Explain Hive 1.2.0 w/Tez:
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
> Explain Hive 0.13 w/Tez:
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE) 
> {code}
> Results:
> {code}
>       Hive 1.2.0 w/Spark 1.3.1:
>               Finished successfully in 7.09 seconds
>       Hive 1.2.0 w/Mapreduce:
>               Stage 1: 32 Seconds
>               Stage 2: 35 Seconds
>       Hive 1.2.0 w/Tez 0.5.3:
>               Time taken: 565.025 seconds, Fetched: 11516 row(s)
>       
>       Hive 0.13 w/Tez 0.4.0:
>               Time taken: 13.552 seconds, Fetched: 11516 row(s)
> {code}
> And finally looking at the Dag Attempt that is stuck for 500 seconds or so in 
> Tez it looks to be stuck running the same method over and over again:
> {code}
> 8 duration=2561 from=org.apache.hadoop.hive.ql.exec.tez.RecordProcessor>
> 2015-05-18 19:58:41,719 INFO [TezChild] exec.Utilities: PLAN PATH = 
> hdfs://xhadnnm1p.example.com:8020/tmp/hive/gss2002/dbc4b0b5-7859-4487-a56d-969440bc5e90/hive_2015-05-18_19-58-25_951_5497535752804149087-1/gss2002/_tez_scratch_dir/4e635121-c4cd-4e3f-b96b-9f08a6a7bf5d/map.xml
> 2015-05-18 19:58:41,822 INFO [TezChild] exec.MapOperator: MAP[4]: records 
> read - 1
> 2015-05-18 19:58:41,835 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 19:58:41,848 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> ......
> 2015-05-18 20:07:46,560 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,574 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,587 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,603 INFO [TezChild] io.HiveContextAwareRecordReader: 
> Processing file 
> hdfs://xhadnnm1p.example.com:8020/example_dw/crc/arsn/part-m-00000
> 2015-05-18 20:07:46,603 INFO [TezChild] log.PerfLogger: </PERFLOG 
> method=TezRunProcessor start=1431993518764 end=1431994066603 duration=547839 
> from=org.apache.hadoop.hive.ql.exec.tez.TezProcessor>
> 2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 4 finished. 
> closing... 
> 2015-05-18 20:07:46,603 INFO [TezChild] exec.MapOperator: 
> RECORDS_IN_Map_1:13440
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (HIVE-10746) Hive 1.2.0 w/ Tez 0.5.3/Tez 0.6.0 produces 1-byte FileSplits from TextInputFormat

Reply via email to