Interesting that Hive solves this with a separate file in the
distributed cache, I was curious how Hive dealt with it.

Given that Hadoop has Jackson as a dependency, is it safe to assume
every HCatalog user will have Jackson available. If we took the same
approach and serialized to XML we would not require another
dependency.

Renato - what are your thoughts? We don't want to do the whole patch for you ;)

--travis


On Wed, Sep 5, 2012 at 1:35 PM, Rohini Palaniswamy
<[email protected]> wrote:
> Yes, it would be a good idea to serialize that to a separate file and use
> distributed cache for it. Hive does it that way by serializing the plan and
> partition information(MapredWork) to a xml file. And we should also
> investigate the serialized data and the way we serialize it to avoid
> bloating or inefficiency. For eg in hive, the serialization is done
> badly(HIVE-2988) and that makes the client require at least 1G for memory
> when querying table involving large number of partitions.
>
>   I think the easier approach would be to move it to a separate file first
> and avoid the max jobconf issue and then work on optimizing the serialized
> data for size.  Because how much ever we optimize it, it will not prove
> scalable to keep the partition information in jobconf for very big tables.
>
> -Rohini
>
> On Wed, Sep 5, 2012 at 7:32 AM, Travis Crawford 
> <[email protected]>wrote:
>
>> Thought I wrote back to this on the train but it didn't send :/
>>
>> Agreed the distributed cache would be a good way to distribute this
>> info to worker tasks, since it reuses an existing MR feature for
>> sharing files with worker tasks.
>>
>> Something I'm curious about is why the partition info we store in the
>> jobconf is so large, it naively feels like we may be serializing too
>> much stuff and that could be trimmed down.
>>
>> As a starting point, I'd take a look at this test which creates a
>> table, adds a partition, and performs a query:
>>
>>
>> http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatHiveThriftCompatibility.java?view=markup
>>
>> We could do something like this and just make a huge number of
>> partitions and see when the jobconf becomes "too large", then profile
>> exactly what that bloat is. We could investigate trimming it down,
>> compressing, or using distributed cache.
>>
>> Going with the test-first approach would be useful to pinpoint the
>> actual issue, then we can investigate the right approach to solve.
>>
>> Does this sound like a good starting point? Holler if you run into any
>> issues along the way!
>>
>> --travis
>>
>>
>>
>> On Tue, Sep 4, 2012 at 5:39 PM, Alan Gates <[email protected]> wrote:
>> >
>> > On Sep 1, 2012, at 4:38 PM, Renato Marroquín Mogrovejo wrote:
>> >
>> >> Hi Travis,
>> >>
>> >> Thanks a ton for this issue I know I will enjoy solving this (: So I
>> >> have some questions about this jira even though I think I understand
>> >> what the problem is.
>> >>
>> >> - How do you think I should approach this? I mean if HCat can't send
>> >> the partitions' information through the configuration object, maybe we
>> >> should think on a different way of communicating this information
>> >> (thrift, or the database)?
>> > Thrift or the database aren't options.  You can't count on being able to
>> communicate with the client from the map tasks, not to mention you would
>> overwhelm the client.  One of the rules of hcat is the map and reduce tasks
>> should never talk to the database, as it isn't sized to handle large
>> numbers of tasks talking to it.
>> >
>> > My first thought would be to use the distributed cache.  You should only
>> use this option when you have a very large number of files.  But in that
>> case write them to a file, put that file in the distributed cache, and then
>> put a pointer to that in the job conf instead of the file list.
>> >
>> > Alan.
>> >
>> >> - I was looking at HCatLoader but I am not sue if this would be a good
>> >> entry point for the modifications. Any suggestions?
>> >>
>> >> Thanks again Travis!
>> >>
>> >>
>> >> Renato M.
>> >>
>> >>
>> >> 2012/8/30 Travis Crawford <[email protected]>:
>> >>> You might be interested in
>> https://issues.apache.org/jira/browse/HCATALOG-453
>> >>>
>> >>> The issue here is HCatalog queries the HiveMetaStore for info about
>> >>> the partitions to process, and stores that response in the job conf.
>> >>> When processing large numbers of partitions this bloats the job conf
>> >>> beyond what Hadoop will allow and the job fails.
>> >>>
>> >>> What's interesting about this issue is you'll learn about the main
>> >>> feature of HCatalog - translating db+table+partition_spec into a list
>> >>> of partitions, how HCat handles that internally, and how its
>> >>> communicated between the frontend & backend. The actual issue is
>> >>> straightforward, but I think spending the time to understand the
>> >>> problem will give a great overview of how HCat works.
>> >>>
>> >>> Thoughts?
>> >>>
>> >>> --travis
>> >>>
>> >>>
>> >>>
>> >>> On Thu, Aug 30, 2012 at 4:25 PM, Renato Marroquín Mogrovejo
>> >>> <[email protected]> wrote:
>> >>>> Travis,
>> >>>>
>> >>>> Thanks a lot for your response! My master's dissertation was about
>> >>>> using statistics to smarten up Apache Pig rule optimizer, so I would
>> >>>> love to help out with something related, but maybe you can suggest me
>> >>>> some interesting jiras (not complicated ones but maybe "noobies" ones)
>> >>>> I can start with (:
>> >>>> And yeah the labels thing is much better than creating a jura type for
>> >>>> noobies. Thanks again!
>> >>>>
>> >>>>
>> >>>> Renato M.
>> >>>>
>> >>>> 2012/8/30 Travis Crawford <[email protected]>:
>> >>>>> Hey Renato -
>> >>>>>
>> >>>>> Awesome! What in particular are you interested in starting out with?
>> >>>>> We can definitely find a starter project for you in that area.
>> >>>>>
>> >>>>> JIRA issues can have a variety of attributes; the attribute I started
>> >>>>> this thread about is the "issue type".
>> >>>>>
>> >>>>> JIRA also has "labels", which I think are a great place to indicate
>> >>>>> something would be good for noobies. For example, there could be an
>> >>>>> "issue type" of bug, with "label" noobie.
>> >>>>>
>> >>>>> Let us know what area you're interested in diving into and we can
>> help
>> >>>>> come up with a starter project for ya.
>> >>>>>
>> >>>>> --travis
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Aug 30, 2012 at 9:21 AM, Renato Marroquín Mogrovejo
>> >>>>> <[email protected]> wrote:
>> >>>>>> Hi all,
>> >>>>>>
>> >>>>>> I am new to HCatalog but I would like to get involved with the
>> >>>>>> project, and one thing that would totally help is to create an issue
>> >>>>>> type that indicates it is for "newbies". I saw that in Apache Pig
>> they
>> >>>>>> have a special type of issue for this and with this they try to
>> engage
>> >>>>>> more with the community. This would be awesome guys!
>> >>>>>> Thanks in advance!
>> >>>>>>
>> >>>>>>
>> >>>>>> Renato M.
>> >>>>>>
>> >>>>>> 2012/8/30 Travis Crawford <[email protected]>:
>> >>>>>>> Hey hcat gurus -
>> >>>>>>>
>> >>>>>>> Filing an issue just now I noticed the list of possible option
>> types
>> >>>>>>> is pretty crazy long - any objection to requesting a simplification
>> >>>>>>> to:
>> >>>>>>>
>> >>>>>>> PROPOSED ISSUE TYPES:
>> >>>>>>>
>> >>>>>>> Bug - fixing unintended behavior
>> >>>>>>> New Feature - addition of brand-new functionality
>> >>>>>>> Improvement - making existing functionality better
>> >>>>>>>
>> >>>>>>> CURRENT ISSUE TYPES:
>> >>>>>>>
>> >>>>>>> Bug
>> >>>>>>> New Feature
>> >>>>>>> Improvement
>> >>>>>>> Test
>> >>>>>>> Wish
>> >>>>>>> Task
>> >>>>>>> New JIRA Project
>> >>>>>>> RTC
>> >>>>>>> TCK Challenge
>> >>>>>>> Question
>> >>>>>>> Temp
>> >>>>>>> Brainstorming
>> >>>>>>> Umbrella
>> >>>>>>> Epic
>> >>>>>>> Dependency upgrade
>> >>>>>>> Suitable Name Search
>> >>>>>>>
>> >>>>>>> If this sounds good I'll ping the infra folks and try to make this
>> happen.
>> >>>>>>>
>> >>>>>>> --travis
>> >
>>

Reply via email to