[
https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229239#comment-15229239
]
Owen O'Malley commented on HIVE-13345:
--------------------------------------
The current leaking of the OrcProto objects outside of the reader
implementation is problematic and should be fixed.
For fast loading, we should create a ReaderImpl constructor that takes a
serialized file tail. The C++ implementation uses:
// The contents of the file tail that must be serialized.
message FileTail {
optional PostScript postscript = 1;
optional Footer footer = 2;
optional uint64 fileLength = 3;
optional uint64 postscriptLength = 4;
}
I assume you aren't proposing doing hand rolled serialization, which would be
very error prone. If I'd seen flatbuffers before I started ORC, I would have
been tempted to go that way. Now it would be too much pain for too little gain.
> LLAP: metadata cache takes too much space, esp. with bloom filters, due to
> Java/protobuf overhead
> -------------------------------------------------------------------------------------------------
>
> Key: HIVE-13345
> URL: https://issues.apache.org/jira/browse/HIVE-13345
> Project: Hive
> Issue Type: Bug
> Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe
> metadata takes 200-500Kb on real files, and with bloom filters blowing up
> more than x5 due to being stored as list of Long-s, up to 5Mb per stripe.
> That is undesirable.
> We should either create better objects for ORC (might be good in general) or
> store serialized metadata and deserialize when needed.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)