[
https://issues.apache.org/jira/browse/PHOENIX-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16830387#comment-16830387
]
Ashu Pachauri commented on PHOENIX-5258:
----------------------------------------
[~elserj] Prashant is a colleague of mine. To make the use case clearer, we
often receive data from multiple source that we ingest into the same table. Not
all CSVs contain the same set of columns, but always a subset of columns
present in the table schema.
One way is to always pass a different list of inputcolumns to the bulk load
tool. Another, much cleaner way, is to keep the header with the data. A single
run of the tool will still expect consistent header across files, but you don't
have to reconcile the command line params for each file separately because the
header sits with the data itself.
> Add support to parse header from the input CSV file as input columns for
> CsvBulkLoadTool
> ----------------------------------------------------------------------------------------
>
> Key: PHOENIX-5258
> URL: https://issues.apache.org/jira/browse/PHOENIX-5258
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Prashant Vithani
> Priority: Minor
>
> Currently, CsvBulkLoadTool does not support reading header from the input csv
> and expects the content of the csv to match with the table schema. The
> support for the header can be added to dynamically map the schema with the
> header.
> The proposed solution is to introduce another option for the tool `–header`.
> If this option is passed, the input columns list is constructed by reading
> the first line of the input CSV file.
> * If there is only one file, read the header from the first line and
> generate the `ColumnInfo` list.
> * If there are multiple files, read the header from all the files, and throw
> an error if the headers across files do not match.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)