Hi Min, Hyunsik, I am throwing my name to help to run Tajo in YARN since this is, I believe, one of the most pressing issue to have Tajo as part of Hadoop ecosystem.
I would love to work with Min, Hyunsik, and anyone else interested to make this happen. I heard Hyoung Jun already looking at Slider (known as Hoya before) so looking forward to heard more about it. I was thinking about Slider (AKA Hoya, potential incubating), Twill, or Apache Helix (with support of provisioning in YARN or Mesos) - Henry On Fri, Apr 4, 2014 at 6:19 AM, Hyunsik Choi <[email protected]> wrote: > Hi Min, > >> I'd like to see tajo can run on a Yarn cluster. This is quite useful for >> sharing data with other distributed systems, like mapreduce, spark. > > Yes, I missed Yarn! Thank you for suggesting it. We cannot postpone to > support Yarn. In my view, Llima or Slider would be a nice candidate in > this time in order to deploy a Tajo instance in a Yarn cluster. We > need to schedule it to our short term roadmap. How do you think about > it? > >> Besides that, I think basic user authentication like hadoop's >> UserGroupInformation is useful for multi-users sharing a tajo cluster's >> computing capacity. > > I agree with this idea. I'll file Yarn and UserGroupInformation on > multi-tenant category in our roadmap. > >> Seems I added more work to do, can we internally release some sprints? After >> the sprint, we can fire an official release? > > We can make an official release after the sprint. I intended it. > >> Regarding to shuffle, do you have any proposal to improve it? Could you just >> drop a few lines to show your opinion here? > > The main issue about shuffle is that, like ealier MR and Spark, too > many small files are created during shuffle phase. This approach > results in many random I/O and give a not trivial burden to operating > system. Consequently, this approach also limits scalability and is not > efficient. As you know, the typical solution is to make a consolidated > file (sorted and grouped in shuffle keys) per task with a simple > index. As far as I know, MR and Spark do in the manner. In addition, > OS cache utilization of intermediate data, and smart scheduling > between writing and fetching are would be helpful to improve the > current shuffle approach. > > Thanks, > Hyunsik > > On Fri, Apr 4, 2014 at 2:56 PM, Min Zhou <[email protected]> wrote: >> Hi Hyunsik, >> >> I'd like to see tajo can run on a Yarn cluster. This is quite useful for >> sharing data with other distributed systems, like mapreduce, spark. >> >> Besides that, I think basic user authentication like hadoop's >> UserGroupInformation is useful for multi-users sharing a tajo cluster's >> computing capacity. >> >> The above 2 it's both a part of multi-tenancy support. >> >> Seems I added more work to do, can we internally release some sprints? >> After the sprint, we can fire an official release? >> >> Regarding to shuffle, do you have any proposal to improve it? Could you >> just drop a few lines to show your opinion here? >> >> >> >> >> Min >> >> >> On Thu, Apr 3, 2014 at 10:24 PM, Hyunsik Choi <[email protected]> wrote: >> >>> Hi folks, >>> >>> I'm very happy to see that our community is growing! Also, It's a pleasure >>> to discuss the Tajo 0.8.0 release. Recently, I've tested various features >>> in various contexts, and tried to figure out if there are any critical >>> problems. I think that there are only a few issues and we can release 0.8.0 >>> next week. If there are further issues to be solved before the 0.8.0 >>> release, feel free to suggest ideas. >>> >>> Also, I'd like to discuss our next roadmap. We are open to any suggestion >>> from users, contributors, and committers. Please fire away! >>> >>> I'm thinking that our next stage should focus on improving the way Tajo >>> runs in thousands of large cluster nodes and for a number of concurrent >>> users. The key issues associated with this include the following: >>> >>> * High availability >>> * Multi-tenancy scheduling >>> * More stability >>> * Improved shuffle >>> >>> The current work status is as follows. Min is working on Tajo's new >>> scheduler (TAJO-540) based on sparrow. I'll support him. As far as I know, >>> Alvin is working on TajoMaster HA (TAJO-704). Also, some guys including >>> myself are investigating and solving the issues which occur in large >>> clusters. These issues should be solved in order to make Tajo a complete >>> enterprise-ready production. >>> >>> In addition, there are some SQL feature support issues. Many analytic >>> problems require window functions. Also, in-subquery and scalar subquery >>> should be supported. So, I'd like to schedule them with high priority. In >>> my view, there will be very few SQL support issues if Tajo provides these >>> features. >>> >>> Besides those areas, David is working on a nested schema and its related >>> work (TAJO-710). I guess this will take quite a while because it requires a >>> lot of hard work. So, it would be great to schedule the nested schema >>> loosely. That's just my thoughts, anyhow. >>> >>> Aside from the discussion of our roadmap, I'd like to suggest that we need >>> to release more frequently after the 0.8.0 release. So far, there has been >>> a long period between each release because Tajo is undergoing heavy >>> development. By 'releasing early, releasing often', we will make more >>> tighter feedback loop between users and developers. >>> >>> I think that there are many additional many interesting issues to be >>> included in our roadmap. Feel free to suggest your idea. We will arrange >>> our short-term roadmap and long-term roadmap based on your suggestions. >>> >>> Thank you all so much for your contribution! >>> >>> Warm Regards, >>> Hyunsik >>> >> >> >> >> -- >> My research interests are distributed systems, parallel computing and >> bytecode based virtual machine. >> >> My profile: >> http://www.linkedin.com/in/coderplay >> My blog: >> http://coderplay.javaeye.com
