[
https://issues.apache.org/jira/browse/HIVE-21924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Mustafa Iman updated HIVE-21924:
--------------------------------
Attachment: HIVE-21924.6.patch
Status: Patch Available (was: In Progress)
> Split text files even if header/footer exists
> ---------------------------------------------
>
> Key: HIVE-21924
> URL: https://issues.apache.org/jira/browse/HIVE-21924
> Project: Hive
> Issue Type: Improvement
> Components: File Formats
> Affects Versions: 2.4.0, 4.0.0, 3.2.0
> Reporter: Prasanth Jayachandran
> Assignee: Mustafa Iman
> Priority: Major
> Labels: pull-request-available
> Attachments: HIVE-21924.2.patch, HIVE-21924.3.patch,
> HIVE-21924.4.patch, HIVE-21924.5.patch, HIVE-21924.6.patch, HIVE-21924.patch
>
> Time Spent: 3h 10m
> Remaining Estimate: 0h
>
> https://github.com/apache/hive/blob/967a1cc98beede8e6568ce750ebeb6e0d048b8ea/ql/src/java/org/apache/hadoop/hive/ql/io/HiveInputFormat.java#L494-L503
>
> {code}
> int headerCount = 0;
> int footerCount = 0;
> if (table != null) {
> headerCount = Utilities.getHeaderCount(table);
> footerCount = Utilities.getFooterCount(table, conf);
> if (headerCount != 0 || footerCount != 0) {
> // Input file has header or footer, cannot be splitted.
> HiveConf.setLongVar(conf, ConfVars.MAPREDMINSPLITSIZE,
> Long.MAX_VALUE);
> }
> }
> {code}
> this piece of code makes the CSV (or any text files with header/footer) files
> not splittable if header or footer is present.
> If only header is present, we can find the offset after first line break and
> use that to split. Similarly for footer, may be read few KB's of data at the
> end and find the last line break offset. Use that to determine the data range
> which can be used for splitting. Few reads during split generation are
> cheaper than not splitting the file at all.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)