Reducing JobConf size used by HCatInputFormat
---------------------------------------------

                 Key: HCATALOG-142
                 URL: https://issues.apache.org/jira/browse/HCATALOG-142
             Project: HCatalog
          Issue Type: Improvement
    Affects Versions: 0.3
            Reporter: Sushanth Sowmyan


Currently, the .setInput() call in HCat fetches information regarding all the 
partitions we want to read from, and stores it in the JobConf. The reason it 
stores it there is because it is statically called, and that information is 
required at the time the MR framework calls getSplits(). Since the first call 
is a static call and the second is a call on an object instantiated by the MR 
framework (implying no member variable based info passing), we pass that 
information along through the JobConf.

Now, we could move the place where we contact the metastore to the getSplits() 
time, which means we contact the metastore late, but that breaks other things 
like being able to check whether the input can/will succeed, or checking the 
schema/etc. Now, we could follow a hybrid approach to address that too, and 
contact the metastore during the setInput() to get the schema, check whether 
input is possible, and not get the partition objects at that time to set in the 
jobconf, and then contact the metastore again during the getSplits() to 
populate the splits with information fetched from the partition objects.

Issues with this approach still exist :
a) Multiple contacts to the metastore increase number of times metastore load 
(technically, it's still only moving accesses around, so it should be okay, 
just that it's separated a bit more)
b) Things like testing whether the partition objects are valid, whether the 
storage drivers specified exist/can be instantiated, etc are now at getSplits() 
time, which means the programs have a harder time of error-handling, since this 
happens after they submit a job rather than as a pre-run check-time. (this 
should also be okay for most programs)

Further discussion/thoughts on this issue is welcome. :)

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to