[ https://issues.apache.org/jira/browse/HIVE-5795?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16395861#comment-16395861 ]
BELUGA BEHR commented on HIVE-5795: ----------------------------------- This is an oldie, but a goodie. I was looking at this ticket and trying to decide if I should propose that we drop this feature from Hive 3.0. I saw that there are some reports here of there being issues... a suggestion that this should be an "experimental" feature. For how long should that go on for? I just ran some sample data threw this feature using Google's [NGram|http://storage.googleapis.com/books/ngrams/books/datasetsv2.html] data. I converted it to be comma separated (CSV) instead of tab separated (TSV) before loading it into HDFS. I also added a header to each file in the Hive table: {{ngram,year,match_count,volume_count}}. {code:sql} -- 2.2 GB / 1 File CREATE EXTERNAL TABLE ngram1 ( ngram string, year int, match_count int, volume_count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/admin/ngram1'; -- 2.2 GB / 1 File CREATE EXTERNAL TABLE ngram2 ( ngram string, year int, match_count int, volume_count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/admin/ngram2' tblproperties ("skip.header.line.count"="1"); -- 5 Files / Each file has 1 header + 9 rows of data (10 total rows) CREATE EXTERNAL TABLE ngram3 ( ngram string, year int, match_count int, volume_count int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LOCATION '/user/admin/ngram3' tblproperties ("skip.header.line.count"="1"); {code} {code:java|title=select count(1) from ngram1} INFO : Compiling command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969): select count(1) from ngram1 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969); Time taken: 0.047 seconds INFO : Executing command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969): select count(1) from ngram1 INFO : Query ID = hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969 INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks determined at compile time: 1 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=<number> INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=<number> INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces=<number> INFO : number of splits:9 INFO : Submitting tokens for job: job_1520871396884_0011 INFO : The url to track the job: http://myhost:8088/proxy/application_1520871396884_0011/ INFO : Starting Job = job_1520871396884_0011, Tracking URL = http://myhost:8088/proxy/application_1520871396884_0011/ INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1520871396884_0011 INFO : Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 1 INFO : 2018-03-12 13:27:32,281 Stage-1 map = 0%, reduce = 0% INFO : 2018-03-12 13:27:51,424 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 8.69 sec INFO : 2018-03-12 13:27:53,520 Stage-1 map = 70%, reduce = 0%, Cumulative CPU 91.66 sec INFO : 2018-03-12 13:27:54,553 Stage-1 map = 85%, reduce = 0%, Cumulative CPU 93.15 sec INFO : 2018-03-12 13:27:55,586 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 97.5 sec INFO : 2018-03-12 13:28:01,793 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 100.0 sec INFO : MapReduce Total cumulative CPU time: 1 minutes 40 seconds 0 msec INFO : Ended Job = job_1520871396884_0011 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 9 Reduce: 1 Cumulative CPU: 100.0 sec HDFS Read: 2317486328 HDFS Write: 10 SUCCESS INFO : Total MapReduce CPU Time Spent: 1 minutes 40 seconds 0 msec INFO : Completed executing command(queryId=hive_20180312132727_5a75ce57-53e5-4ab2-9b97-31f8b3502969); Time taken: 36.666 seconds INFO : OK -- Result: 110075425 {code} {code:java|title=select count(1) from ngram2} INFO : Compiling command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d): select count(1) from ngram2 INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:_c0, type:bigint, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d); Time taken: 0.046 seconds INFO : Executing command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d): select count(1) from ngram2 INFO : Query ID = hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks determined at compile time: 1 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=<number> INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=<number> INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces=<number> INFO : number of splits:9 INFO : Submitting tokens for job: job_1520871396884_0012 INFO : The url to track the job: http://myhost:8088/proxy/application_1520871396884_0012/ INFO : Starting Job = job_1520871396884_0012, Tracking URL = http://myhost:8088/proxy/application_1520871396884_0012/ INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1520871396884_0012 INFO : Hadoop job information for Stage-1: number of mappers: 9; number of reducers: 1 INFO : 2018-03-12 13:30:17,376 Stage-1 map = 0%, reduce = 0% INFO : 2018-03-12 13:30:36,432 Stage-1 map = 11%, reduce = 0%, Cumulative CPU 8.83 sec INFO : 2018-03-12 13:30:37,478 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 39.6 sec INFO : 2018-03-12 13:30:38,513 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 75.89 sec INFO : 2018-03-12 13:30:39,543 Stage-1 map = 85%, reduce = 0%, Cumulative CPU 87.82 sec INFO : 2018-03-12 13:30:40,573 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 89.99 sec INFO : 2018-03-12 13:30:41,604 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 91.79 sec INFO : 2018-03-12 13:30:46,771 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 94.58 sec INFO : MapReduce Total cumulative CPU time: 1 minutes 34 seconds 580 msec INFO : Ended Job = job_1520871396884_0012 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 9 Reduce: 1 Cumulative CPU: 94.58 sec HDFS Read: 2317486734 HDFS Write: 10 SUCCESS INFO : Total MapReduce CPU Time Spent: 1 minutes 34 seconds 580 msec INFO : Completed executing command(queryId=hive_20180312133030_804a096b-f5d8-46c8-a255-4a958e6d2f0d); Time taken: 35.149 seconds INFO : OK -- Result: 110075424 {code} {code:java|title=select count(1) from ngram3} INFO : Executing command(queryId=hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e): select count(1) from ngram3 INFO : Query ID = hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e INFO : Total jobs = 1 INFO : Launching Job 1 out of 1 INFO : Starting task [Stage-1:MAPRED] in serial mode INFO : Number of reduce tasks determined at compile time: 1 INFO : In order to change the average load for a reducer (in bytes): INFO : set hive.exec.reducers.bytes.per.reducer=<number> INFO : In order to limit the maximum number of reducers: INFO : set hive.exec.reducers.max=<number> INFO : In order to set a constant number of reducers: INFO : set mapreduce.job.reduces=<number> INFO : number of splits:1 INFO : Submitting tokens for job: job_1520871396884_0013 INFO : The url to track the job: http://myhost:8088/proxy/application_1520871396884_0013/ INFO : Starting Job = job_1520871396884_0013, Tracking URL = http://myhost:8088/proxy/application_1520871396884_0013/ INFO : Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1520871396884_0013 INFO : Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 INFO : 2018-03-12 13:32:21,292 Stage-1 map = 0%, reduce = 0% INFO : 2018-03-12 13:32:26,519 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.44 sec INFO : 2018-03-12 13:32:32,755 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 4.0 sec INFO : MapReduce Total cumulative CPU time: 4 seconds 0 msec INFO : Ended Job = job_1520871396884_0013 INFO : MapReduce Jobs Launched: INFO : Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 4.0 sec HDFS Read: 9229 HDFS Write: 3 SUCCESS INFO : Total MapReduce CPU Time Spent: 4 seconds 0 msec INFO : Completed executing command(queryId=hive_20180312133232_a10fcc3c-d8d6-4377-86b5-c76e9cdc9c8e); Time taken: 17.337 seconds INFO : OK -- Result: 45 {code} >From this little test, we can see that the *skip.header.line.count* seems to >be working correctly. For a large file with 9 splits/mapper, we can see that >it correctly removed only 1 row from the 'count' results. Hive did not process >the entire file in one Mapper; it correctly processed in parallel. For a data >set with 5 files (10 rows each file), Hive correctly trimmed off one row from >each file, for a total of 45 row of content, across multiple files in the >context of a single Mapper. And yet, I do no see any [documentation|https://www.google.com/search?q=site:apache.org+%22skip.header.line.count%22] covering this feature, except in some [Impala documentation|https://impala.apache.org/docs/build/html/topics/impala_txtfile.html]. Is this feature still useful or should the feature be dropped and the guidance would be to do some sort of pre-processing before loading into HDFS? If this feature is being kept, it should be better documented. > Hive should be able to skip header and footer rows when reading data file for > a table > ------------------------------------------------------------------------------------- > > Key: HIVE-5795 > URL: https://issues.apache.org/jira/browse/HIVE-5795 > Project: Hive > Issue Type: New Feature > Reporter: Shuaishuai Nie > Assignee: Shuaishuai Nie > Priority: Major > Labels: TODOC13 > Fix For: 0.13.0 > > Attachments: HIVE-5795.1.patch, HIVE-5795.2.patch, HIVE-5795.3.patch, > HIVE-5795.4.patch, HIVE-5795.5.patch > > > Hive should be able to skip header and footer lines when reading data file > from table. In this way, user don't need to processing data which generated > by other application with a header or footer and directly use the file for > table operations. > To implement this, the idea is adding new properties in table descriptions to > define the number of lines in header and footer and skip them when reading > the record from record reader. An DDL example for creating a table with > header and footer should be like this: > {code} > Create external table testtable (name string, message string) row format > delimited fields terminated by '\t' lines terminated by '\n' location > '/testtable' tblproperties ("skip.header.line.count"="1", > "skip.footer.line.count"="2"); > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)