Re: [DISCUSS] 0.8.0 release and next roadmap

Hyunsik Choi Sat, 05 Apr 2014 06:57:11 -0700

Hi David,

Thank you for your nice suggestions. I totally agree with your
suggestion, and we are looking forward to the work to have Tajo run on
Yarn cluster. Actually, I have thought that there may be two
approaches to support Yarn. The first approach is to use deploy a Tajo
cluster (i.e., a long-running application) in Yarn cluster. The second
approach is that each query becomes an individual application in Yarn.
I also think that there will be other approaches that I didn't
imagine. I believe that we will find nice approaches and the effort
would be very exciting.


I agree with the significance of nested schema and non-scalar types.
You seem to proceed the work faster. As you mentioned, we could
release Tajo with different-level of schema extensions.

Table partitioning is very interesting area. As far as I know, dynamic
partitions are still challenge in many systems. There will be
interesting problem to be solved. I think that you can readily figure
out the current status of Tajo's table partition if you take a look at
SortBasedColPartitionStoreExec and HashBasedColPartitionStoreExec. I
hope that your exploration would be exciting.

Many thanks,
Hyunsik

On Sat, Apr 5, 2014 at 12:20 AM, David Chen <[email protected]> wrote:
> Hi Hyunsik,
>
> Thank you very much for sharing the roadmap. I am very excited for the 0.8.0 
> release and for the projects on the roadmap for future releases.
>
> I agree with Min that Tajo on YARN will be an important project. I think 
> there will be a good amount of work to not only have Tajo run on YARN but 
> also run well on a YARN cluster co-resident with other YARN applications. I 
> think through this effort, we will likely also find areas of for improving 
> multi-tenancy in YARN as well since YARN is still relatively young and has 
> not been battle-tested that much yet.
>
> As you mentioned, one of the projects I would like to focus on is adding 
> support for nested schemas and non-scalar types. This way, we would be able 
> to take full advantage of columnar storage formats like Parquet, which is 
> designed to work well with nested schemas. I understand that this will be a 
> significant project, but I think it may be possible to divide up the work as 
> I have done with the sub-tasks to TAJO-710 and push out support for each type 
> incrementally across different releases.
>
> Another area that I would like to learn some more about is partitioning. I 
> have just begun to look at TAJO-283 and am still ramping up on some of the 
> context and the current status of the effort, but I am interested in 
> exploring the possibility of enabling smart dynamic partitioning based on the 
> way a table is queried but avoiding some of the current problems of dynamic 
> partitioning such as creating too many files. One possible approach that I am 
> thinking about is the possibility of building indices that point to offsets 
> within files. Anyway, this is still more of a research problem, but is one 
> that I would like to explore.
>
> Thanks,
> David
>
> On Apr 3, 2014, at 10:24 PM, Hyunsik Choi <[email protected]> wrote:
>
>> Hi folks,
>>
>> I'm very happy to see that our community is growing! Also, It's a pleasure
>> to discuss the Tajo 0.8.0 release. Recently, I've tested various features
>> in various contexts, and tried to figure out if there are any critical
>> problems. I think that there are only a few issues and we can release 0.8.0
>> next week. If there are further issues to be solved before the 0.8.0
>> release, feel free to suggest ideas.
>>
>> Also, I'd like to discuss our next roadmap. We are open to any suggestion
>> from users, contributors, and committers. Please fire away!
>>
>> I'm thinking that our next stage should focus on improving the way Tajo
>> runs in thousands of large cluster nodes and for a number of concurrent
>> users. The key issues associated with this include the following:
>>
>> * High availability
>> * Multi-tenancy scheduling
>> * More stability
>> * Improved shuffle
>>
>> The current work status is as follows. Min is working on Tajo's new
>> scheduler (TAJO-540) based on sparrow. I'll support him. As far as I know,
>> Alvin is working on TajoMaster HA (TAJO-704). Also, some guys including
>> myself are investigating and solving the issues which occur in large
>> clusters. These issues should be solved in order to make Tajo a complete
>> enterprise-ready production.
>>
>> In addition, there are some SQL feature support issues. Many analytic
>> problems require window functions. Also, in-subquery and scalar subquery
>> should be supported. So, I'd like to schedule them with high priority. In
>> my view, there will be very few SQL support issues if Tajo provides these
>> features.
>>
>> Besides those areas, David is working on a nested schema and its related
>> work (TAJO-710). I guess this will take quite a while because it requires a
>> lot of hard work. So, it would be great to schedule the nested schema
>> loosely. That's just my thoughts, anyhow.
>>
>> Aside from the discussion of our roadmap, I'd like to suggest that we need
>> to release more frequently after the 0.8.0 release. So far, there has been
>> a long period between each release because Tajo is undergoing heavy
>> development. By 'releasing early, releasing often', we will make more
>> tighter feedback loop between users and developers.
>>
>> I think that there are many additional many interesting issues to be
>> included in our roadmap. Feel free to suggest your idea. We will arrange
>> our short-term roadmap and long-term roadmap based on your suggestions.
>>
>> Thank you all so much for your contribution!
>>
>> Warm Regards,
>> Hyunsik
>

Re: [DISCUSS] 0.8.0 release and next roadmap

Reply via email to