[ 
https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13861144#comment-13861144
 ] 

Min Zhou commented on TAJO-472:
-------------------------------

Hi Jihoon Son,

If you create a table with a postfix "cache" on the table name, shark will load 
the table into a RDD object in memory. I don't think spark can cache 
intermediate data. 

I don't know what you meant on the intermediate data.  Is it a temporary table 
or the shuffle data?  If we don't support sub-queries, I think intermediate 
data is not quite common. Reuse a cached table is straightforward. The problem 
is how to determine a subquery do the same job as another query or its 
subquery. Actually, I did a job on hive in my previous company much like cache 
table :)  We stores the intermediate tables produced by subqueries of SQL, and 
do a md5sum on those subqueries' serialized plan ,  after than the md5 value 
will be stored into metadata. The subqueries in a subsequent query will be 
calculated into other md5, if those md5 match one of the value in metadata, 
simply load the intermediate data without recompute it.

Currently, I am thinking about manually cache tables like the way takes in 
shark.  Holding an histogram of those data is a good approach, however,  it 
need some efforts. AFAIK, there is a role who can decide which table it should 
be cached. This role would be the cluster administrator, or data warehouse 
architect.  Sometimes it should works better in the real world than 
automatically due to automation can't guarantee the SLA.  Ideally, I'd like 
support both way and make it as an option.  

   

> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>
> Previously, I was involved as a technical expert into an in-memory database 
> for on-line businesses in Alibaba group. That's  an internal project, which 
> can do group by aggregation on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. 
> From some benchmark,  we believe that spark&shark currently is the fastest 
> solution among all the open source interactive query system , such as impala, 
> presto, tajo.  The main reason is that it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed 
> of tajo. Actually , this is the reason why I concerned at table partition 
> during Xmas and new year holidays. 
> Will submit a proposal soon.
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to