[jira] [Commented] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table

Min Zhou (JIRA) Sat, 04 Jan 2014 15:52:24 -0800

    [ 
https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862447#comment-13862447
 ]


Min Zhou commented on TAJO-472:
-------------------------------

Hi Jihoon, 

Thanks for the comment.  I meant push model, not pull model. The latter one I 
believe tajo has already implemented it.

For your first and second questions. I feel sorry that I didn't describe it 
clearly. HDFS is just for fault tolerant in my design, since it has by default 
3 replicas for each file. After writing columnar data into hdfs, tajo master 
will choose workers to download one copy of those data into their local disk.  
Subsequent query will be executed on those local blocks or those block's 
in-memory mapping rather than remotely fetch data from hdfs. If one of those 
node failed, we can choose another node download data from hdfs.

For the third question.  This is prepare for index.  If we build index, say 
inverted index,  on those data packs, when we do a query like select col1 from 
tbl where col2 = 10 and col3 < 6. We can leverage index do the filter (col2 = 2 
and col3 < 6) more faster than brute force scanning if the selectivity is low,  
for example 0.1%.  Indexes will return the rowids  of those data which satisfy 
the filter condition. We can escape a block I/O if  *rowid / number of rows per 
block* doesn't match that block number. Blocks with the same number of row 
which can easily avoid unnecessary I/O is commonly used in infobright beside 
the  scenario I described above. Another reason is that, equivalent number of 
rows is more efficient for reading tuples from the same row in a columnar 
database.

The fourth one. A worker is idler than the others or not  is determined by  
resources managements.  


Thanks,
Min


  


> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
>                 Key: TAJO-472
>                 URL: https://issues.apache.org/jira/browse/TAJO-472
>             Project: Tajo
>          Issue Type: New Feature
>          Components: distributed query plan, physical operator
>            Reporter: Min Zhou
>            Assignee: Min Zhou
>         Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database 
> for on-line businesses in Alibaba group. That's  an internal project, which 
> can do group by aggregation on billions of rows in less than 1 second.  
> I'd like to apply this technology into tajo, make it much faster than it is. 
> From some benchmark,  we believe that spark&shark currently is the fastest 
> solution among all the open source interactive query system , such as impala, 
> presto, tajo.  The main reason is that it benefit from in-memory data. 
> I will take memory cached table as my first step to  accelerate query speed 
> of tajo. Actually , this is the reason why I concerned at table partition 
> during Xmas and new year holidays. 
> Will submit a proposal soon.
>   



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

[jira] [Commented] (TAJO-472) Umbrella ticket for accelerating query speed through memory cached table

Reply via email to