Reducing JobConf size used by HCatInputFormat
---------------------------------------------
Key: HCATALOG-142
URL: https://issues.apache.org/jira/browse/HCATALOG-142
Project: HCatalog
Issue Type: Improvement
Affects Versions: 0.3
Reporter: Sushanth Sowmyan
Currently, the .setInput() call in HCat fetches information regarding all the
partitions we want to read from, and stores it in the JobConf. The reason it
stores it there is because it is statically called, and that information is
required at the time the MR framework calls getSplits(). Since the first call
is a static call and the second is a call on an object instantiated by the MR
framework (implying no member variable based info passing), we pass that
information along through the JobConf.
Now, we could move the place where we contact the metastore to the getSplits()
time, which means we contact the metastore late, but that breaks other things
like being able to check whether the input can/will succeed, or checking the
schema/etc. Now, we could follow a hybrid approach to address that too, and
contact the metastore during the setInput() to get the schema, check whether
input is possible, and not get the partition objects at that time to set in the
jobconf, and then contact the metastore again during the getSplits() to
populate the splits with information fetched from the partition objects.
Issues with this approach still exist :
a) Multiple contacts to the metastore increase number of times metastore load
(technically, it's still only moving accesses around, so it should be okay,
just that it's separated a bit more)
b) Things like testing whether the partition objects are valid, whether the
storage drivers specified exist/can be instantiated, etc are now at getSplits()
time, which means the programs have a harder time of error-handling, since this
happens after they submit a job rather than as a pre-run check-time. (this
should also be okay for most programs)
Further discussion/thoughts on this issue is welcome. :)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira