[
https://issues.apache.org/jira/browse/HIVE-12860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Pengcheng Xiong updated HIVE-12860:
-----------------------------------
Target Version/s: 3.0.0 (was: 2.2.0)
> Add WITH HEADER option to INSERT OVERWRITE DIRECTORY
> ----------------------------------------------------
>
> Key: HIVE-12860
> URL: https://issues.apache.org/jira/browse/HIVE-12860
> Project: Hive
> Issue Type: New Feature
> Components: Hive
> Reporter: Elliot West
> Assignee: Elliot West
>
> _As a Hive user_
> _I'd like the option to seamlessly write out a header row to file system
> based result sets_
> _So that I can generate reports with a specification that mandates a header
> row._
> h3. Motivations
> There is a significant use-case where Hive is used to construct a scheduled
> data processing pipeline that generates a report in HDFS for consumption by
> some third party (internal or external). This report may then be transferred
> out of the system for consumption by other tools or processes. It is not
> uncommon for the third party to specify that the report includes a header row
> at the start of the file. The current options for adding headers are
> difficult to use effectively and elegantly.
> h3. Acceptance criteria
> * {{INSERT OVERWRITE DIRECTORY}} commands can be invoked with an option to
> include a header row at the start of the result set file.
> * The header row will contain the column names derived from the accompanying
> {{SELECT}} query.
> * It will likely be the case that multiple tasks will be writing the final
> file of the query result set. In this event only the task writing the first
> chunk of the file should emit the header row.
> h3. Proposed HQL changes
> {code}
> 1. INSERT OVERWRITE [LOCAL] DIRECTORY directory1
> 2. [ROW FORMAT row_format] [STORED AS file_format]
> 3. [WITH HEADER]
> 4. SELECT ... FROM ...
> {code}
> It is proposed that the {{WITH HEADER}} stanza at line 3 be introduced to
> enable this feature.
> h3. Current workarounds
> * It is usually suggested that users set the CLI option
> {{hive.cli.print.header=true}} and capture the result set from standard out.
> However, this does not work well in scheduled, headless environments such as
> the Oozie Hive action. This can also push the file handling into shell
> scripts and complicate the process of getting the report into HDFS.
> * The keep report processing entirely within the domain of Hive some users
> {{UNION}} the result of their query with a tiny table of a single row
> containing the header names. A synthesised rank column is used with an
> {{ORDER BY}} to ensure that the header is written to the very start of the
> file. See [this example on Stack
> Overflow|http://stackoverflow.com/questions/15139561/adding-column-headers-to-hive-result-set/25214480#25214480].
> h3. References
> * HIVE-138: Original request for header functionality.
> * [Hive Wiki: writing data into the file system from
> queries|https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-Writingdataintothefilesystemfromqueries].
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)