[
https://issues.apache.org/jira/browse/TAJO-475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13862449#comment-13862449
]
Min Zhou edited comment on TAJO-475 at 1/5/14 12:28 AM:
--------------------------------------------------------
Yes, your correctly understood my meaning. For the first step, we can build a
pure in-memory data structure. That's quite fast and straightforward. However,
for a long term goal. We should think about those aspect.
1. Memory is expensive. There is a proverb, use our limited funds where they
can be put to best use. We can cut down the footprint through compression, LRU
cache and multiply-layer storage memory ->SSD -> hard disk.
2. I am not familiar with the yarn mode of tajo. From my knowledge, the worker
is spawned by nodemanager on demand. Since the workers can't always standup,
they can't keep data in-memory for sharing with subsequent queries. A solution
is put this cache manager as a aux service in nodemanager like shuffle service
in hadoop mapreduce.
3. If we embed a cache service in nodemanager, we should think about how the
workers can share cached data from that service. That is an inter-processes
sharing. The most efficient way to achieve on unix system is mmap. see
http://en.wikipedia.org/wiki/Mmap . In java, Filechannel can mapping a file
into memory, its backend is mmap. To be better, if we use those kind of
direct memory rather than allocated from heap, we can get a zero GC overhead.
see https://gist.github.com/coderplay/8262536 for example.
Thanks for the feedback, really encouraging.
Min
was (Author: coderplay):
Yes, your correctly understood my meaning. For the first step, we can build
in-memory data structure. That's quite fast and straightforward. However, for a
long term goal. We should think about those aspect.
1. Memory is expensive. There is a proverb, use our limited funds where they
can be put to best use. We can cut down the footprint through compression, LRU
cache and multiply-layer storage memory ->SSD -> hard disk.
2. I am not familiar with the yarn mode of tajo. From my knowledge, the worker
is spawned by nodemanager on demand. Since the workers can't always standup,
they can't keep data in-memory for sharing with subsequent queries. A solution
is put this cache manager as a aux service in nodemanager like shuffle service
in hadoop mapreduce.
3. If we embed a cache service in nodemanager, we should think about how the
workers can share cached data from that service. That is an inter-processes
sharing. The most efficient way to achieve on unix system is mmap. see
http://en.wikipedia.org/wiki/Mmap . In java, Filechannel can mapping a file
into memory, its backend is mmap. To be better, if we use those kind of
direct memory rather than allocated from heap, we can get a zero GC overhead.
see https://gist.github.com/coderplay/8262536 for example.
Thanks for the feedback, really encouraging.
Min
> Table partition catalog recap
> -----------------------------
>
> Key: TAJO-475
> URL: https://issues.apache.org/jira/browse/TAJO-475
> Project: Tajo
> Issue Type: Sub-task
> Components: catalog
> Reporter: Min Zhou
> Assignee: Min Zhou
>
> Query master need to know where partitions of memory cached table locate.
> At least we need a meta table contain such information
> |partition_id|
> |partition_value|
> |ordinal_position|
> |locations|
> Any suggestion?
>
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)