On Mon, Feb 23, 2009 at 3:56 PM, pi song <pi.so...@gmail.com> wrote:

> I think the point that you can access more system cache is right but that
> doesn't mean it will be more efficient than accessing from your local disk.
> Take Hadoop for example, your request for file content will have to go to
> Namenode (file chunk indexing service) and then you go ask the data node
> which then provides you data. Assuming that you're working on a large
> dataset, the probability of the data chunk you need staying in system cache
> is very low therefore most of the time you end up reading from a remote
> disk.
>
> I've got a better idea. How about we make the buffer pool multilevel? The
> first level is the current one. The second level represents memory  from
> remote machines. Things that are used less often should stay on the second
> level. Has anyone ever thought about something like this before?
>
> Pi Song
>
> On Mon, Feb 23, 2009 at 1:09 PM, Robert Haas <robertmh...@gmail.com>wrote:
>
>> On Sun, Feb 22, 2009 at 5:18 PM, pi song <pi.so...@gmail.com> wrote:
>> > One more problem is that data placement on HDFS is inherent, meaning you
>> > have no explicit control. Thus, you cannot place two sets of data which
>> are
>> > likely to be joined together on the same node = uncontrollable latency
>> > during query processing.
>> > Pi Song
>>
>> It would only be possible to have the actual PostgreSQL backends
>> running on a single node anyway, because they use shared memory to
>> hold lock tables and things.  The advantage of a distributed file
>> system would be that you could access more storage (and more system
>> buffer cache) than would be possible on a single system (or perhaps
>> the same amount but at less cost).  Assuming some sort of
>> per-tablespace control over the storage manager, you could put your
>> most frequently accessed data locally and the less frequently accessed
>> data into the DFS.
>>
>> But you'd still have to pull all the data back to the master node to
>> do anything with it.  Being able to actually distribute the
>> computation would be a much harder problem.  Currently, we don't even
>> have the ability to bring multiple CPUs to bear on (for example) a
>> large sequential scan (even though all the data is on a single node).
>>
>> ...Robert
>>
>
>

Reply via email to