[ 
https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862751#comment-13862751
 ] 

Jihoon Son commented on TAJO-472:
---------------------------------

Min, sorry about for my misunderstanding. Thanks to your additional comments, I 
finally understand your proposal.
According to your proposal, a cached table is stored on HDFS with the hash 
partitioning for the reliability. Once a table is stored on HDFS, the tajo 
master selects a number of workers who cache the partitioned table into their 
memory. Thus, the data are pre-shuffled when workers finish to download 
partitioned data from HDFS into their local disks. I think that this is a good 
prototype for data cache. 

It is great that we build indexes on data packs. This is definitely necessary 
for partition pruning. However, we also have to consider the performance of the 
sequential scan. As you said, the index is useful only when the selectivity is 
quite low, and thus the sequential scan is useful for other cases. When the 
data packs have the same number of rows, their byte lengths are different 
according to their contents (or types). It means, more file opens and closes 
are required during the sequential scan if the size of a value is very small 
like byte. Definitely, it makes the sequential scan slower. How about change to 
have the same byte length for data packs?

In addition, we need to balance the data distribution for fully utilizing the 
parallelism. Thus, when the tajo master selects workers, it should task account 
into the data distribution as well as workers' remaining resources. 

Thanks,
Jihoon

> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>         Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database 
> for on-line businesses in Alibaba group. That's  an internal project, which 
> can do group by aggregation on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. 
> From some benchmark,  we believe that spark&shark currently is the fastest 
> solution among all the open source interactive query system , such as impala, 
> presto, tajo.  The main reason is that it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed 
> of tajo. Actually , this is the reason why I concerned at table partition 
> during Xmas and new year holidays. 
> Will submit a proposal soon.
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to