[
https://issues.apache.org/jira/browse/TAJO-472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862447#comment-13862447
]
Min Zhou commented on TAJO-472:
-------------------------------
Hi Jihoon,
Thanks for the comment. I meant push model, not pull model. The latter one I
believe tajo has already implemented it.
For your first and second questions. I feel sorry that I didn't describe it
clearly. HDFS is just for fault tolerant in my design, since it has by default
3 replicas for each file. After writing columnar data into hdfs, tajo master
will choose workers to download one copy of those data into their local disk.
Subsequent query will be executed on those local blocks or those block's
in-memory mapping rather than remotely fetch data from hdfs. If one of those
node failed, we can choose another node download data from hdfs.
For the third question. This is prepare for index. If we build index, say
inverted index, on those data packs, when we do a query like select col1 from
tbl where col2 = 10 and col3 < 6. We can leverage index do the filter (col2 = 2
and col3 < 6) more faster than brute force scanning if the selectivity is low,
for example 0.1%. Indexes will return the rowids of those data which satisfy
the filter condition. We can escape a block I/O if *rowid / number of rows per
block* doesn't match that block number. Blocks with the same number of row
which can easily avoid unnecessary I/O is commonly used in infobright beside
the scenario I described above. Another reason is that, equivalent number of
rows is more efficient for reading tuples from the same row in a columnar
database.
The fourth one. A worker is idler than the others or not is determined by
resources managements.
Thanks,
Min
> Umbrella ticket for accelerating query speed through memory cached table
> ------------------------------------------------------------------------
>
> Key: TAJO-472
> URL: https://issues.apache.org/jira/browse/TAJO-472
> Project: Tajo
> Issue Type: New Feature
> Components: distributed query plan, physical operator
> Reporter: Min Zhou
> Assignee: Min Zhou
> Attachments: TAJO-472 Proposal.pdf
>
>
> Previously, I was involved as a technical expert into an in-memory database
> for on-line businesses in Alibaba group. That's an internal project, which
> can do group by aggregation on billions of rows in less than 1 second.
> I'd like to apply this technology into tajo, make it much faster than it is.
> From some benchmark, we believe that spark&shark currently is the fastest
> solution among all the open source interactive query system , such as impala,
> presto, tajo. The main reason is that it benefit from in-memory data.
> I will take memory cached table as my first step to accelerate query speed
> of tajo. Actually , this is the reason why I concerned at table partition
> during Xmas and new year holidays.
> Will submit a proposal soon.
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)