[ 
https://issues.apache.org/jira/browse/HIVE-11344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637678#comment-14637678
 ] 

Sushanth Sowmyan commented on HIVE-11344:
-----------------------------------------

There are three routes I see available here:

a) There is decompress logic in PartInfo.setTableInfo, and compress logic in 
PartInfo.writeObject. we could make it so that PartInfo.writeObject does the 
"compression", writes itself, and then does the decompression back.
b) We could decompress on demand - wherein if a user calls 
getInputFormatClassName(), we then fetch that info if it's not available, and 
always return values consistently.
c) We could add a new conf parameter that controls whether or not we do 
compression - users with 100k splits would prefer compression, and be okay with 
the fact that PartInfo objects are not usable, and users that want to use the 
PartInfo objects will be okay with the fact that they are going to hog a little 
bit more serialized space.

(c) is a bad solution all-round. [~ashutoshc] would be mad at me for adding 
another conf parameter, and it is entirely possible that those that are trying 
to implement other streaming interfaces/etc and are mimicing M/R will run into 
a large number of partitions as well.
(b) is nifty, and I probably like the idea of, but I'm not entirely certain if 
it will run afoul of other serialization methods in the future that call 
getters to get fields (some json serializers) which might result in a bloated 
serialized PartInfo object anyway. Also, it spreads the decompression logic 
across multiple getters, and pushes the assert statement in multiple places as 
well.
(a) is probably the cleanest solution, although it makes a code reader wonder 
why we're going through the gymnastics we are. Some code comments might help 
with that.


> HIVE-9845 makes HCatSplit.write modify the split so that PartitionInfo 
> objects are unusable after it
> ----------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-11344
>                 URL: https://issues.apache.org/jira/browse/HIVE-11344
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.2.0
>            Reporter: Sushanth Sowmyan
>            Assignee: Sushanth Sowmyan
>
> HIVE-9845 introduced a notion of compression for HCatSplits so that when 
> serializing, it finds commonalities between PartInfo and TableInfo objects, 
> and if the two are identical, it nulls out that field in PartInfo, thus 
> making sure that when PartInfo is then serialized, info is not repeated.
> This, however, has the side effect of making the PartInfo object unusable if 
> HCatSplit.write has been called.
> While this does not affect M/R directly, since they do not know about the 
> PartInfo objects and once serialized, the HCatSplit object is recreated by 
> deserializing on the backend, which does restore the split and its PartInfo 
> objects, this does, however, affect framework users of HCat that try to mimic 
> M/R and then use the PartInfo objects to instantiate distinct readers.
> Thus, we need to make it so that PartInfo is still usable after 
> HCatSplit.write is called.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to