[
https://issues.apache.org/jira/browse/HIVE-13985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prasanth Jayachandran updated HIVE-13985:
-----------------------------------------
Attachment: HIVE-13985-branch-1.patch
> ORC improvements for reducing the file system calls in task side
> ----------------------------------------------------------------
>
> Key: HIVE-13985
> URL: https://issues.apache.org/jira/browse/HIVE-13985
> Project: Hive
> Issue Type: Bug
> Components: ORC
> Affects Versions: 2.2.0
> Reporter: Prasanth Jayachandran
> Assignee: Prasanth Jayachandran
> Attachments: HIVE-13985-branch-1.patch, HIVE-13985-branch-2.1.patch,
> HIVE-13985.1.patch, HIVE-13985.2.patch
>
>
> HIVE-13840 fixed some issues with addition file system invocations during
> split generation. Similarly, this jira will fix issues with additional file
> system invocations on the task side. To avoid reading footers on the task
> side, users can set hive.orc.splits.include.file.footer to true which will
> serialize the orc footers on the splits. But this has issues with serializing
> unwanted information like column statistics and other metadata which are not
> really required for reading orc split on the task side. We can reduce the
> payload on the orc splits by serializing only the minimum required
> information (stripe information, types, compression details). This will
> decrease the payload on the orc splits and can potentially avoid OOMs in
> application master (AM) during split generation. This jira also address other
> issues concerning the AM cache. The local cache used by AM is soft reference
> cache. This can introduce unpredictability across multiple runs of the same
> query. We can cache the serialized footer in the local cache and also use
> strong reference cache which should avoid memory pressure and will have
> better predictability.
> One other improvement that we can do is when
> hive.orc.splits.include.file.footer is set to false, on the task side we make
> one additional file system call to know the size of the file. If we can
> serialize the file length in the orc split this can be avoided.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)