[jira] [Commented] (HIVE-5795) Hive should be able to skip header and footer rows when reading data file for a table

BELUGA BEHR (JIRA) Mon, 12 Mar 2018 13:47:07 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395861#comment-16395861
 ]


BELUGA BEHR commented on HIVE-5795:
-----------------------------------

This is an oldie, but a goodie.

I was looking at this ticket and trying to decide if I should propose that we 
drop this feature from Hive 3.0. I saw that there are some reports here of 
there being issues... a suggestion that this should be an "experimental" 
feature. For how long should that go on for?

I just ran some sample data threw this feature using Google's 
[NGram|http://storage.googleapis.com/books/ngrams/books/datasetsv2.html] data. 
I converted it to be comma separated (CSV) instead of tab separated (TSV) 
before loading it into HDFS. I also added a header to each file in the Hive 
table: {{ngram,year,match_count,volume_count}}.
{code:sql}
-- 2.2 GB / 1 File
CREATE EXTERNAL TABLE ngram1 (
  ngram string,
  year int,
  match_count int,
  volume_count int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/admin/ngram1';

-- 2.2 GB / 1 File
CREATE EXTERNAL TABLE ngram2 (
  ngram string,
  year int,
  match_count int,
  volume_count int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/admin/ngram2'
tblproperties ("skip.header.line.count"="1");

-- 5 Files / Each file has 1 header + 9 rows of data (10 total rows)
CREATE EXTERNAL TABLE ngram3 (
  ngram string,
  year int,
  match_count int,
  volume_count int
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/user/admin/ngram3'
tblproperties ("skip.header.line.count"="1");
{code}
{code:java|title=select count(1) from ngram1}
INFO  : Compiling 
command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969): 
select count(1) from ngram1
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, 
type:bigint, comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969); Time 
taken: 0.047 seconds
INFO  : Executing 
command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969): 
select count(1) from ngram1
INFO  : Query ID = hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:9
INFO  : Submitting tokens for job: job_1520871396884_0011
INFO  : The url to track the job: 
http://myhost:8088/proxy/application_1520871396884_0011/
INFO  : Starting Job = job_1520871396884_0011, Tracking URL = 
http://myhost:8088/proxy/application_1520871396884_0011/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill 
job_1520871396884_0011
INFO  : Hadoop job information for Stage-1: number of mappers: 9; number of 
reducers: 1
INFO  : 2018-03-12 13:27:32,281 Stage-1 map = 0%,  reduce = 0%
INFO  : 2018-03-12 13:27:51,424 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
8.69 sec
INFO  : 2018-03-12 13:27:53,520 Stage-1 map = 70%,  reduce = 0%, Cumulative CPU 
91.66 sec
INFO  : 2018-03-12 13:27:54,553 Stage-1 map = 85%,  reduce = 0%, Cumulative CPU 
93.15 sec
INFO  : 2018-03-12 13:27:55,586 Stage-1 map = 100%,  reduce = 0%, Cumulative 
CPU 97.5 sec
INFO  : 2018-03-12 13:28:01,793 Stage-1 map = 100%,  reduce = 100%, Cumulative 
CPU 100.0 sec
INFO  : MapReduce Total cumulative CPU time: 1 minutes 40 seconds 0 msec
INFO  : Ended Job = job_1520871396884_0011
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 9  Reduce: 1   Cumulative CPU: 100.0 sec   HDFS 
Read: 2317486328 HDFS Write: 10 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 minutes 40 seconds 0 msec
INFO  : Completed executing 
command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969); Time 
taken: 36.666 seconds
INFO  : OK

-- Result: 110075425
{code}
{code:java|title=select count(1) from ngram2}
INFO  : Compiling 
command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d): 
select count(1) from ngram2
INFO  : Semantic Analysis Completed
INFO  : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, 
type:bigint, comment:null)], properties:null)
INFO  : Completed compiling 
command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d); Time 
taken: 0.046 seconds
INFO  : Executing 
command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d): 
select count(1) from ngram2
INFO  : Query ID = hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:9
INFO  : Submitting tokens for job: job_1520871396884_0012
INFO  : The url to track the job: 
http://myhost:8088/proxy/application_1520871396884_0012/
INFO  : Starting Job = job_1520871396884_0012, Tracking URL = 
http://myhost:8088/proxy/application_1520871396884_0012/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill 
job_1520871396884_0012
INFO  : Hadoop job information for Stage-1: number of mappers: 9; number of 
reducers: 1
INFO  : 2018-03-12 13:30:17,376 Stage-1 map = 0%,  reduce = 0%
INFO  : 2018-03-12 13:30:36,432 Stage-1 map = 11%,  reduce = 0%, Cumulative CPU 
8.83 sec
INFO  : 2018-03-12 13:30:37,478 Stage-1 map = 44%,  reduce = 0%, Cumulative CPU 
39.6 sec
INFO  : 2018-03-12 13:30:38,513 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 
75.89 sec
INFO  : 2018-03-12 13:30:39,543 Stage-1 map = 85%,  reduce = 0%, Cumulative CPU 
87.82 sec
INFO  : 2018-03-12 13:30:40,573 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 
89.99 sec
INFO  : 2018-03-12 13:30:41,604 Stage-1 map = 100%,  reduce = 0%, Cumulative 
CPU 91.79 sec
INFO  : 2018-03-12 13:30:46,771 Stage-1 map = 100%,  reduce = 100%, Cumulative 
CPU 94.58 sec
INFO  : MapReduce Total cumulative CPU time: 1 minutes 34 seconds 580 msec
INFO  : Ended Job = job_1520871396884_0012
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 9  Reduce: 1   Cumulative CPU: 94.58 sec   HDFS 
Read: 2317486734 HDFS Write: 10 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 1 minutes 34 seconds 580 msec
INFO  : Completed executing 
command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d); Time 
taken: 35.149 seconds
INFO  : OK

-- Result: 110075424
{code}
{code:java|title=select count(1) from ngram3}
INFO  : Executing 
command(queryId=hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e): 
select count(1) from ngram3
INFO  : Query ID = hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e
INFO  : Total jobs = 1
INFO  : Launching Job 1 out of 1
INFO  : Starting task [Stage-1:MAPRED] in serial mode
INFO  : Number of reduce tasks determined at compile time: 1
INFO  : In order to change the average load for a reducer (in bytes):
INFO  :   set hive.exec.reducers.bytes.per.reducer=<number>
INFO  : In order to limit the maximum number of reducers:
INFO  :   set hive.exec.reducers.max=<number>
INFO  : In order to set a constant number of reducers:
INFO  :   set mapreduce.job.reduces=<number>
INFO  : number of splits:1
INFO  : Submitting tokens for job: job_1520871396884_0013
INFO  : The url to track the job: 
http://myhost:8088/proxy/application_1520871396884_0013/
INFO  : Starting Job = job_1520871396884_0013, Tracking URL = 
http://myhost:8088/proxy/application_1520871396884_0013/
INFO  : Kill Command = /usr/lib/hadoop/bin/hadoop job  -kill 
job_1520871396884_0013
INFO  : Hadoop job information for Stage-1: number of mappers: 1; number of 
reducers: 1
INFO  : 2018-03-12 13:32:21,292 Stage-1 map = 0%,  reduce = 0%
INFO  : 2018-03-12 13:32:26,519 Stage-1 map = 100%,  reduce = 0%, Cumulative 
CPU 1.44 sec
INFO  : 2018-03-12 13:32:32,755 Stage-1 map = 100%,  reduce = 100%, Cumulative 
CPU 4.0 sec
INFO  : MapReduce Total cumulative CPU time: 4 seconds 0 msec
INFO  : Ended Job = job_1520871396884_0013
INFO  : MapReduce Jobs Launched: 
INFO  : Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.0 sec   HDFS Read: 
9229 HDFS Write: 3 SUCCESS
INFO  : Total MapReduce CPU Time Spent: 4 seconds 0 msec
INFO  : Completed executing 
command(queryId=hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e); Time 
taken: 17.337 seconds
INFO  : OK

-- Result: 45
{code}
>From this little test, we can see that the *skip.header.line.count* seems to 
>be working correctly. For a large file with 9 splits/mapper, we can see that 
>it correctly removed only 1 row from the 'count' results. Hive did not process 
>the entire file in one Mapper; it correctly processed in parallel. For a data 
>set with 5 files (10 rows each file), Hive correctly trimmed off one row from 
>each file, for a total of 45 row of content, across multiple files in the 
>context of a single Mapper.

And yet, I do no see any 
[documentation|https://www.google.com/search?q=site:apache.org+%22skip.header.line.count%22]
 covering this feature, except in some [Impala 
documentation|https://impala.apache.org/docs/build/html/topics/impala_txtfile.html].

Is this feature still useful or should the feature be dropped and the guidance 
would be to do some sort of pre-processing before loading into HDFS? If this 
feature is being kept, it should be better documented.

> Hive should be able to skip header and footer rows when reading data file for 
> a table
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-5795
>                 URL: https://issues.apache.org/jira/browse/HIVE-5795
>             Project: Hive
>          Issue Type: New Feature
>            Reporter: Shuaishuai Nie
>            Assignee: Shuaishuai Nie
>            Priority: Major
>              Labels: TODOC13
>             Fix For: 0.13.0
>
>         Attachments: HIVE-5795.1.patch, HIVE-5795.2.patch, HIVE-5795.3.patch, 
> HIVE-5795.4.patch, HIVE-5795.5.patch
>
>
> Hive should be able to skip header and footer lines when reading data file 
> from table. In this way, user don't need to processing data which generated 
> by other application with a header or footer and directly use the file for 
> table operations.
> To implement this, the idea is adding new properties in table descriptions to 
> define the number of lines in header and footer and skip them when reading 
> the record from record reader. An DDL example for creating a table with 
> header and footer should be like this:
> {code}
> Create external table testtable (name string, message string) row format 
> delimited fields terminated by '\t' lines terminated by '\n' location 
> '/testtable' tblproperties ("skip.header.line.count"="1", 
> "skip.footer.line.count"="2");
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (HIVE-5795) Hive should be able to skip header and footer rows when reading data file for a table

Reply via email to