[ 
https://issues.apache.org/jira/browse/HIVE-13345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15229239#comment-15229239
 ] 

Owen O'Malley commented on HIVE-13345:
--------------------------------------

The current leaking of the OrcProto objects outside of the reader 
implementation is problematic and should be fixed.

For fast loading, we should create a ReaderImpl constructor that takes a 
serialized file tail. The C++ implementation uses:

// The contents of the file tail that must be serialized.
message FileTail {
  optional PostScript postscript = 1;
  optional Footer footer = 2;
  optional uint64 fileLength = 3;
  optional uint64 postscriptLength = 4;
}

I assume you aren't proposing doing hand rolled serialization, which would be 
very error prone. If I'd seen flatbuffers before I started ORC, I would have 
been tempted to go that way. Now it would be too much pain for too little gain.



> LLAP: metadata cache takes too much space, esp. with bloom filters, due to 
> Java/protobuf overhead
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-13345
>                 URL: https://issues.apache.org/jira/browse/HIVE-13345
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> We cache java objects currently; these have high overhead, average stripe 
> metadata takes 200-500Kb on real files, and with bloom filters blowing up 
> more than x5 due to being stored as list of Long-s, up to 5Mb per stripe. 
> That is undesirable.
> We should either create better objects for ORC (might be good in general) or 
> store serialized metadata and deserialize when needed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to