[
https://issues.apache.org/jira/browse/HIVE-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637678#comment-14637678
]
Sushanth Sowmyan commented on HIVE-11344:
-----------------------------------------
There are three routes I see available here:
a) There is decompress logic in PartInfo.setTableInfo, and compress logic in
PartInfo.writeObject. we could make it so that PartInfo.writeObject does the
"compression", writes itself, and then does the decompression back.
b) We could decompress on demand - wherein if a user calls
getInputFormatClassName(), we then fetch that info if it's not available, and
always return values consistently.
c) We could add a new conf parameter that controls whether or not we do
compression - users with 100k splits would prefer compression, and be okay with
the fact that PartInfo objects are not usable, and users that want to use the
PartInfo objects will be okay with the fact that they are going to hog a little
bit more serialized space.
(c) is a bad solution all-round. [~ashutoshc] would be mad at me for adding
another conf parameter, and it is entirely possible that those that are trying
to implement other streaming interfaces/etc and are mimicing M/R will run into
a large number of partitions as well.
(b) is nifty, and I probably like the idea of, but I'm not entirely certain if
it will run afoul of other serialization methods in the future that call
getters to get fields (some json serializers) which might result in a bloated
serialized PartInfo object anyway. Also, it spreads the decompression logic
across multiple getters, and pushes the assert statement in multiple places as
well.
(a) is probably the cleanest solution, although it makes a code reader wonder
why we're going through the gymnastics we are. Some code comments might help
with that.
> HIVE-9845 makes HCatSplit.write modify the split so that PartitionInfo
> objects are unusable after it
> ----------------------------------------------------------------------------------------------------
>
> Key: HIVE-11344
> URL: https://issues.apache.org/jira/browse/HIVE-11344
> Project: Hive
> Issue Type: Bug
> Affects Versions: 1.2.0
> Reporter: Sushanth Sowmyan
> Assignee: Sushanth Sowmyan
>
> HIVE-9845 introduced a notion of compression for HCatSplits so that when
> serializing, it finds commonalities between PartInfo and TableInfo objects,
> and if the two are identical, it nulls out that field in PartInfo, thus
> making sure that when PartInfo is then serialized, info is not repeated.
> This, however, has the side effect of making the PartInfo object unusable if
> HCatSplit.write has been called.
> While this does not affect M/R directly, since they do not know about the
> PartInfo objects and once serialized, the HCatSplit object is recreated by
> deserializing on the backend, which does restore the split and its PartInfo
> objects, this does, however, affect framework users of HCat that try to mimic
> M/R and then use the PartInfo objects to instantiate distinct readers.
> Thus, we need to make it so that PartInfo is still usable after
> HCatSplit.write is called.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)