Re: [DISCUSS] 0.8.0 release and next roadmap

Henry Saputra Sat, 05 Apr 2014 13:59:23 -0700

Hi Min, Hyunsik,

I am throwing my name to help to run Tajo in YARN since this is, I
believe, one of the most pressing issue to have Tajo as part of Hadoop
ecosystem.


I would love to work with Min, Hyunsik, and anyone else interested to
make this happen.
I heard Hyoung Jun already looking at Slider (known as Hoya before) so
looking forward to heard more about it.

I was thinking about Slider (AKA Hoya, potential incubating), Twill,
or Apache Helix (with support of provisioning in YARN or Mesos)

- Henry

On Fri, Apr 4, 2014 at 6:19 AM, Hyunsik Choi <[email protected]> wrote:
> Hi Min,
>
>> I'd like to see tajo can run on a Yarn cluster. This is quite useful for 
>> sharing data with other distributed systems, like mapreduce, spark.
>
> Yes, I missed Yarn! Thank you for suggesting it. We cannot postpone to
> support Yarn. In my view, Llima or Slider would be a nice candidate in
> this time in order to deploy a Tajo instance in a Yarn cluster. We
> need to schedule it to our short term roadmap. How do you think about
> it?
>
>> Besides that, I think basic user authentication like hadoop's 
>> UserGroupInformation is useful for multi-users sharing a tajo cluster's 
>> computing capacity.
>
> I agree with this idea. I'll file Yarn and UserGroupInformation on
> multi-tenant category in our roadmap.
>
>> Seems I added more work to do, can we internally release some sprints? After 
>> the sprint, we can fire an official release?
>
> We can make an official release after the sprint. I intended it.
>
>> Regarding to shuffle, do you have any proposal to improve it? Could you just 
>> drop a few lines to show your opinion here?
>
> The main issue about shuffle is that, like ealier MR and Spark, too
> many small files are created during shuffle phase. This approach
> results in many random I/O and give a not trivial burden to operating
> system. Consequently, this approach also limits scalability and is not
> efficient. As you know, the typical solution is to make a consolidated
> file (sorted and grouped in shuffle keys) per task with a simple
> index. As far as I know, MR and Spark do in the manner. In addition,
> OS cache utilization of intermediate data, and smart scheduling
> between writing and fetching are would be helpful to improve the
> current shuffle approach.
>
> Thanks,
> Hyunsik
>
> On Fri, Apr 4, 2014 at 2:56 PM, Min Zhou <[email protected]> wrote:
>> Hi Hyunsik,
>>
>> I'd like to see tajo can run on a Yarn cluster. This is quite useful for
>> sharing data with other distributed systems, like mapreduce, spark.
>>
>> Besides that, I think basic user authentication like hadoop's
>> UserGroupInformation is useful for multi-users sharing a tajo cluster's
>> computing capacity.
>>
>> The above 2 it's both a part of multi-tenancy support.
>>
>> Seems I added more work to do, can we internally release some sprints?
>> After the sprint, we can fire an official release?
>>
>> Regarding to shuffle, do you have any proposal to improve it? Could you
>> just drop a few lines to show your opinion here?
>>
>>
>>
>>
>> Min
>>
>>
>> On Thu, Apr 3, 2014 at 10:24 PM, Hyunsik Choi <[email protected]> wrote:
>>
>>> Hi folks,
>>>
>>> I'm very happy to see that our community is growing! Also, It's a pleasure
>>> to discuss the Tajo 0.8.0 release. Recently, I've tested various features
>>> in various contexts, and tried to figure out if there are any critical
>>> problems. I think that there are only a few issues and we can release 0.8.0
>>> next week. If there are further issues to be solved before the 0.8.0
>>> release, feel free to suggest ideas.
>>>
>>> Also, I'd like to discuss our next roadmap. We are open to any suggestion
>>> from users, contributors, and committers. Please fire away!
>>>
>>> I'm thinking that our next stage should focus on improving the way Tajo
>>> runs in thousands of large cluster nodes and for a number of concurrent
>>> users. The key issues associated with this include the following:
>>>
>>> * High availability
>>> * Multi-tenancy scheduling
>>> * More stability
>>> * Improved shuffle
>>>
>>> The current work status is as follows. Min is working on Tajo's new
>>> scheduler (TAJO-540) based on sparrow. I'll support him. As far as I know,
>>> Alvin is working on TajoMaster HA (TAJO-704). Also, some guys including
>>> myself are investigating and solving the issues which occur in large
>>> clusters. These issues should be solved in order to make Tajo a complete
>>> enterprise-ready production.
>>>
>>> In addition, there are some SQL feature support issues. Many analytic
>>> problems require window functions. Also, in-subquery and scalar subquery
>>> should be supported. So, I'd like to schedule them with high priority. In
>>> my view, there will be very few SQL support issues if Tajo provides these
>>> features.
>>>
>>> Besides those areas, David is working on a nested schema and its related
>>> work (TAJO-710). I guess this will take quite a while because it requires a
>>> lot of hard work. So, it would be great to schedule the nested schema
>>> loosely. That's just my thoughts, anyhow.
>>>
>>> Aside from the discussion of our roadmap, I'd like to suggest that we need
>>> to release more frequently after the 0.8.0 release. So far, there has been
>>> a long period between each release because Tajo is undergoing heavy
>>> development. By 'releasing early, releasing often', we will make more
>>> tighter feedback loop between users and developers.
>>>
>>> I think that there are many additional many interesting issues to be
>>> included in our roadmap. Feel free to suggest your idea. We will arrange
>>> our short-term roadmap and long-term roadmap based on your suggestions.
>>>
>>> Thank you all so much for your contribution!
>>>
>>> Warm Regards,
>>> Hyunsik
>>>
>>
>>
>>
>> --
>> My research interests are distributed systems, parallel computing and
>> bytecode based virtual machine.
>>
>> My profile:
>> http://www.linkedin.com/in/coderplay
>> My blog:
>> http://coderplay.javaeye.com

Re: [DISCUSS] 0.8.0 release and next roadmap

Reply via email to