[
https://issues.apache.org/jira/browse/HCATALOG-142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13136438#comment-13136438
]
Sushanth Sowmyan edited comment on HCATALOG-142 at 10/26/11 9:42 PM:
---------------------------------------------------------------------
Adding some more information regarding impact of this on JobConf size. On
running the unit tests and measuring the cost of adding this information to
JobConf, I get the following:
|| #partitions || serialized String length ||
| 6 | 12756 |
| 1 | 8646 |
| 2 | 9476 |
| 1 | 8646 |
| 1003 | 824042 |
| 1 | 7710 |
| 1 | 8194 |
| 1 | 8134 |
| 3 | 10334 |
| 1 | 8650 |
| 1 | 8650 |
| 2 | 9520 |
| 3 | 10334 |
| 4 | 11416 |
| 1 | 8906 |
| 5 | 12322 |
| 6 | 13148 |
| 1 | 8084 |
| 2 | 9288 |
| 1 | 8536 |
| 1 | 8536 |
| 1 | 8084 |
| 1 | 8084 |
| 1 | 8262 |
| 1 | 8262 |
| 1 | 8174 |
| 1 | 8356 |
That's the length of the String equivalent of the serialized InputJobInfo and
how it changes as number of partitions added increase. This is put in to
JobConf and can chage a bit as a result of serialization of the JobConf itself.
The individual PartInfo objects themselves are consistently about 3k characters
in length, but as the above experiments show, the actual cost of multiple
serialized objects in the conf reduces as the number of partitions increase.
The "cost" per partition goes down from a 8000 chars per partition cost for a
single partition to 821 chars per partition for about 1000 partitions.
was (Author: sushanth):
Adding some more information regarding impact of this on JobConf size. On
running the unit tests and measuring the cost of adding this information to
JobConf, I get the following:
|| #partitions || serialized String length ||
| 6 | 12756 |
| 1 | 8646 |
| 2 | 9476 |
| 1 | 8646 |
| 1003 | 824042 |
| 1 | 7710 |
| 1 | 8194 |
| 1 | 8134 |
| 3 | 10334 |
| 1 | 8650 |
| 1 | 8650 |
| 2 | 9520 |
| 3 | 10334 |
| 4 | 11416 |
| 1 | 8906 |
| 5 | 12322 |
| 6 | 13148 |
| 1 | 8084 |
| 2 | 9288 |
| 1 | 8536 |
| 1 | 8536 |
| 1 | 8084 |
| 1 | 8084 |
| 1 | 8262 |
| 1 | 8262 |
| 1 | 8174 |
| 1 | 8356 |
That's the length of the String equivalent of the serialized InputJobInfo and
how it changes as number of partitions added increase. This is put in to
JobConf and can chage a bit as a result of serialization of the JobConf itself.
The individual PartInfo objects themselves are consistently about 3k characters
in length, but as the above experiments show, the actual cost of multiple
serialized objects in the conf reduces as the number of partitions increase.
The "cost" per partition goes down from a 8000 chars per partition cost for a
single partition to 823 chars per partition for 1000 partitions.
> Reducing JobConf size used by HCatInputFormat
> ---------------------------------------------
>
> Key: HCATALOG-142
> URL: https://issues.apache.org/jira/browse/HCATALOG-142
> Project: HCatalog
> Issue Type: Improvement
> Affects Versions: 0.2, 0.3
> Reporter: Sushanth Sowmyan
> Labels: inputformat, metastore
> Fix For: 0.3
>
>
> Currently, the .setInput() call in HCat fetches information regarding all the
> partitions we want to read from, and stores it in the JobConf. The reason it
> stores it there is because it is statically called, and that information is
> required at the time the MR framework calls getSplits(). Since the first call
> is a static call and the second is a call on an object instantiated by the MR
> framework (implying no member variable based info passing), we pass that
> information along through the JobConf.
> Now, we could move the place where we contact the metastore to the
> getSplits() time, which means we contact the metastore late, but that breaks
> other things like being able to check whether the input can/will succeed, or
> checking the schema/etc. Now, we could follow a hybrid approach to address
> that too, and contact the metastore during the setInput() to get the schema,
> check whether input is possible, and not get the partition objects at that
> time to set in the jobconf, and then contact the metastore again during the
> getSplits() to populate the splits with information fetched from the
> partition objects.
> Issues with this approach still exist :
> a) Multiple contacts to the metastore increase number of times metastore load
> (technically, it's still only moving accesses around, so it should be okay,
> just that it's separated a bit more)
> b) Things like testing whether the partition objects are valid, whether the
> storage drivers specified exist/can be instantiated, etc are now at
> getSplits() time, which means the programs have a harder time of
> error-handling, since this happens after they submit a job rather than as a
> pre-run check-time. (this should also be okay for most programs)
> Further discussion/thoughts on this issue is welcome. :)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira