Thought I wrote back to this on the train but it didn't send :/

Agreed the distributed cache would be a good way to distribute this
info to worker tasks, since it reuses an existing MR feature for
sharing files with worker tasks.

Something I'm curious about is why the partition info we store in the
jobconf is so large, it naively feels like we may be serializing too
much stuff and that could be trimmed down.

As a starting point, I'd take a look at this test which creates a
table, adds a partition, and performs a query:

http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatHiveThriftCompatibility.java?view=markup

We could do something like this and just make a huge number of
partitions and see when the jobconf becomes "too large", then profile
exactly what that bloat is. We could investigate trimming it down,
compressing, or using distributed cache.

Going with the test-first approach would be useful to pinpoint the
actual issue, then we can investigate the right approach to solve.

Does this sound like a good starting point? Holler if you run into any
issues along the way!

--travis



On Tue, Sep 4, 2012 at 5:39 PM, Alan Gates <[email protected]> wrote:
>
> On Sep 1, 2012, at 4:38 PM, Renato Marroquín Mogrovejo wrote:
>
>> Hi Travis,
>>
>> Thanks a ton for this issue I know I will enjoy solving this (: So I
>> have some questions about this jira even though I think I understand
>> what the problem is.
>>
>> - How do you think I should approach this? I mean if HCat can't send
>> the partitions' information through the configuration object, maybe we
>> should think on a different way of communicating this information
>> (thrift, or the database)?
> Thrift or the database aren't options.  You can't count on being able to 
> communicate with the client from the map tasks, not to mention you would 
> overwhelm the client.  One of the rules of hcat is the map and reduce tasks 
> should never talk to the database, as it isn't sized to handle large numbers 
> of tasks talking to it.
>
> My first thought would be to use the distributed cache.  You should only use 
> this option when you have a very large number of files.  But in that case 
> write them to a file, put that file in the distributed cache, and then put a 
> pointer to that in the job conf instead of the file list.
>
> Alan.
>
>> - I was looking at HCatLoader but I am not sue if this would be a good
>> entry point for the modifications. Any suggestions?
>>
>> Thanks again Travis!
>>
>>
>> Renato M.
>>
>>
>> 2012/8/30 Travis Crawford <[email protected]>:
>>> You might be interested in 
>>> https://issues.apache.org/jira/browse/HCATALOG-453
>>>
>>> The issue here is HCatalog queries the HiveMetaStore for info about
>>> the partitions to process, and stores that response in the job conf.
>>> When processing large numbers of partitions this bloats the job conf
>>> beyond what Hadoop will allow and the job fails.
>>>
>>> What's interesting about this issue is you'll learn about the main
>>> feature of HCatalog - translating db+table+partition_spec into a list
>>> of partitions, how HCat handles that internally, and how its
>>> communicated between the frontend & backend. The actual issue is
>>> straightforward, but I think spending the time to understand the
>>> problem will give a great overview of how HCat works.
>>>
>>> Thoughts?
>>>
>>> --travis
>>>
>>>
>>>
>>> On Thu, Aug 30, 2012 at 4:25 PM, Renato Marroquín Mogrovejo
>>> <[email protected]> wrote:
>>>> Travis,
>>>>
>>>> Thanks a lot for your response! My master's dissertation was about
>>>> using statistics to smarten up Apache Pig rule optimizer, so I would
>>>> love to help out with something related, but maybe you can suggest me
>>>> some interesting jiras (not complicated ones but maybe "noobies" ones)
>>>> I can start with (:
>>>> And yeah the labels thing is much better than creating a jura type for
>>>> noobies. Thanks again!
>>>>
>>>>
>>>> Renato M.
>>>>
>>>> 2012/8/30 Travis Crawford <[email protected]>:
>>>>> Hey Renato -
>>>>>
>>>>> Awesome! What in particular are you interested in starting out with?
>>>>> We can definitely find a starter project for you in that area.
>>>>>
>>>>> JIRA issues can have a variety of attributes; the attribute I started
>>>>> this thread about is the "issue type".
>>>>>
>>>>> JIRA also has "labels", which I think are a great place to indicate
>>>>> something would be good for noobies. For example, there could be an
>>>>> "issue type" of bug, with "label" noobie.
>>>>>
>>>>> Let us know what area you're interested in diving into and we can help
>>>>> come up with a starter project for ya.
>>>>>
>>>>> --travis
>>>>>
>>>>>
>>>>> On Thu, Aug 30, 2012 at 9:21 AM, Renato Marroquín Mogrovejo
>>>>> <[email protected]> wrote:
>>>>>> Hi all,
>>>>>>
>>>>>> I am new to HCatalog but I would like to get involved with the
>>>>>> project, and one thing that would totally help is to create an issue
>>>>>> type that indicates it is for "newbies". I saw that in Apache Pig they
>>>>>> have a special type of issue for this and with this they try to engage
>>>>>> more with the community. This would be awesome guys!
>>>>>> Thanks in advance!
>>>>>>
>>>>>>
>>>>>> Renato M.
>>>>>>
>>>>>> 2012/8/30 Travis Crawford <[email protected]>:
>>>>>>> Hey hcat gurus -
>>>>>>>
>>>>>>> Filing an issue just now I noticed the list of possible option types
>>>>>>> is pretty crazy long - any objection to requesting a simplification
>>>>>>> to:
>>>>>>>
>>>>>>> PROPOSED ISSUE TYPES:
>>>>>>>
>>>>>>> Bug - fixing unintended behavior
>>>>>>> New Feature - addition of brand-new functionality
>>>>>>> Improvement - making existing functionality better
>>>>>>>
>>>>>>> CURRENT ISSUE TYPES:
>>>>>>>
>>>>>>> Bug
>>>>>>> New Feature
>>>>>>> Improvement
>>>>>>> Test
>>>>>>> Wish
>>>>>>> Task
>>>>>>> New JIRA Project
>>>>>>> RTC
>>>>>>> TCK Challenge
>>>>>>> Question
>>>>>>> Temp
>>>>>>> Brainstorming
>>>>>>> Umbrella
>>>>>>> Epic
>>>>>>> Dependency upgrade
>>>>>>> Suitable Name Search
>>>>>>>
>>>>>>> If this sounds good I'll ping the infra folks and try to make this 
>>>>>>> happen.
>>>>>>>
>>>>>>> --travis
>

Reply via email to