Interesting that Hive solves this with a separate file in the distributed cache, I was curious how Hive dealt with it.
Given that Hadoop has Jackson as a dependency, is it safe to assume every HCatalog user will have Jackson available. If we took the same approach and serialized to XML we would not require another dependency. Renato - what are your thoughts? We don't want to do the whole patch for you ;) --travis On Wed, Sep 5, 2012 at 1:35 PM, Rohini Palaniswamy <[email protected]> wrote: > Yes, it would be a good idea to serialize that to a separate file and use > distributed cache for it. Hive does it that way by serializing the plan and > partition information(MapredWork) to a xml file. And we should also > investigate the serialized data and the way we serialize it to avoid > bloating or inefficiency. For eg in hive, the serialization is done > badly(HIVE-2988) and that makes the client require at least 1G for memory > when querying table involving large number of partitions. > > I think the easier approach would be to move it to a separate file first > and avoid the max jobconf issue and then work on optimizing the serialized > data for size. Because how much ever we optimize it, it will not prove > scalable to keep the partition information in jobconf for very big tables. > > -Rohini > > On Wed, Sep 5, 2012 at 7:32 AM, Travis Crawford > <[email protected]>wrote: > >> Thought I wrote back to this on the train but it didn't send :/ >> >> Agreed the distributed cache would be a good way to distribute this >> info to worker tasks, since it reuses an existing MR feature for >> sharing files with worker tasks. >> >> Something I'm curious about is why the partition info we store in the >> jobconf is so large, it naively feels like we may be serializing too >> much stuff and that could be trimmed down. >> >> As a starting point, I'd take a look at this test which creates a >> table, adds a partition, and performs a query: >> >> >> http://svn.apache.org/viewvc/incubator/hcatalog/trunk/src/test/org/apache/hcatalog/mapreduce/TestHCatHiveThriftCompatibility.java?view=markup >> >> We could do something like this and just make a huge number of >> partitions and see when the jobconf becomes "too large", then profile >> exactly what that bloat is. We could investigate trimming it down, >> compressing, or using distributed cache. >> >> Going with the test-first approach would be useful to pinpoint the >> actual issue, then we can investigate the right approach to solve. >> >> Does this sound like a good starting point? Holler if you run into any >> issues along the way! >> >> --travis >> >> >> >> On Tue, Sep 4, 2012 at 5:39 PM, Alan Gates <[email protected]> wrote: >> > >> > On Sep 1, 2012, at 4:38 PM, Renato Marroquín Mogrovejo wrote: >> > >> >> Hi Travis, >> >> >> >> Thanks a ton for this issue I know I will enjoy solving this (: So I >> >> have some questions about this jira even though I think I understand >> >> what the problem is. >> >> >> >> - How do you think I should approach this? I mean if HCat can't send >> >> the partitions' information through the configuration object, maybe we >> >> should think on a different way of communicating this information >> >> (thrift, or the database)? >> > Thrift or the database aren't options. You can't count on being able to >> communicate with the client from the map tasks, not to mention you would >> overwhelm the client. One of the rules of hcat is the map and reduce tasks >> should never talk to the database, as it isn't sized to handle large >> numbers of tasks talking to it. >> > >> > My first thought would be to use the distributed cache. You should only >> use this option when you have a very large number of files. But in that >> case write them to a file, put that file in the distributed cache, and then >> put a pointer to that in the job conf instead of the file list. >> > >> > Alan. >> > >> >> - I was looking at HCatLoader but I am not sue if this would be a good >> >> entry point for the modifications. Any suggestions? >> >> >> >> Thanks again Travis! >> >> >> >> >> >> Renato M. >> >> >> >> >> >> 2012/8/30 Travis Crawford <[email protected]>: >> >>> You might be interested in >> https://issues.apache.org/jira/browse/HCATALOG-453 >> >>> >> >>> The issue here is HCatalog queries the HiveMetaStore for info about >> >>> the partitions to process, and stores that response in the job conf. >> >>> When processing large numbers of partitions this bloats the job conf >> >>> beyond what Hadoop will allow and the job fails. >> >>> >> >>> What's interesting about this issue is you'll learn about the main >> >>> feature of HCatalog - translating db+table+partition_spec into a list >> >>> of partitions, how HCat handles that internally, and how its >> >>> communicated between the frontend & backend. The actual issue is >> >>> straightforward, but I think spending the time to understand the >> >>> problem will give a great overview of how HCat works. >> >>> >> >>> Thoughts? >> >>> >> >>> --travis >> >>> >> >>> >> >>> >> >>> On Thu, Aug 30, 2012 at 4:25 PM, Renato Marroquín Mogrovejo >> >>> <[email protected]> wrote: >> >>>> Travis, >> >>>> >> >>>> Thanks a lot for your response! My master's dissertation was about >> >>>> using statistics to smarten up Apache Pig rule optimizer, so I would >> >>>> love to help out with something related, but maybe you can suggest me >> >>>> some interesting jiras (not complicated ones but maybe "noobies" ones) >> >>>> I can start with (: >> >>>> And yeah the labels thing is much better than creating a jura type for >> >>>> noobies. Thanks again! >> >>>> >> >>>> >> >>>> Renato M. >> >>>> >> >>>> 2012/8/30 Travis Crawford <[email protected]>: >> >>>>> Hey Renato - >> >>>>> >> >>>>> Awesome! What in particular are you interested in starting out with? >> >>>>> We can definitely find a starter project for you in that area. >> >>>>> >> >>>>> JIRA issues can have a variety of attributes; the attribute I started >> >>>>> this thread about is the "issue type". >> >>>>> >> >>>>> JIRA also has "labels", which I think are a great place to indicate >> >>>>> something would be good for noobies. For example, there could be an >> >>>>> "issue type" of bug, with "label" noobie. >> >>>>> >> >>>>> Let us know what area you're interested in diving into and we can >> help >> >>>>> come up with a starter project for ya. >> >>>>> >> >>>>> --travis >> >>>>> >> >>>>> >> >>>>> On Thu, Aug 30, 2012 at 9:21 AM, Renato Marroquín Mogrovejo >> >>>>> <[email protected]> wrote: >> >>>>>> Hi all, >> >>>>>> >> >>>>>> I am new to HCatalog but I would like to get involved with the >> >>>>>> project, and one thing that would totally help is to create an issue >> >>>>>> type that indicates it is for "newbies". I saw that in Apache Pig >> they >> >>>>>> have a special type of issue for this and with this they try to >> engage >> >>>>>> more with the community. This would be awesome guys! >> >>>>>> Thanks in advance! >> >>>>>> >> >>>>>> >> >>>>>> Renato M. >> >>>>>> >> >>>>>> 2012/8/30 Travis Crawford <[email protected]>: >> >>>>>>> Hey hcat gurus - >> >>>>>>> >> >>>>>>> Filing an issue just now I noticed the list of possible option >> types >> >>>>>>> is pretty crazy long - any objection to requesting a simplification >> >>>>>>> to: >> >>>>>>> >> >>>>>>> PROPOSED ISSUE TYPES: >> >>>>>>> >> >>>>>>> Bug - fixing unintended behavior >> >>>>>>> New Feature - addition of brand-new functionality >> >>>>>>> Improvement - making existing functionality better >> >>>>>>> >> >>>>>>> CURRENT ISSUE TYPES: >> >>>>>>> >> >>>>>>> Bug >> >>>>>>> New Feature >> >>>>>>> Improvement >> >>>>>>> Test >> >>>>>>> Wish >> >>>>>>> Task >> >>>>>>> New JIRA Project >> >>>>>>> RTC >> >>>>>>> TCK Challenge >> >>>>>>> Question >> >>>>>>> Temp >> >>>>>>> Brainstorming >> >>>>>>> Umbrella >> >>>>>>> Epic >> >>>>>>> Dependency upgrade >> >>>>>>> Suitable Name Search >> >>>>>>> >> >>>>>>> If this sounds good I'll ping the infra folks and try to make this >> happen. >> >>>>>>> >> >>>>>>> --travis >> > >>
