[
https://issues.apache.org/jira/browse/PIG-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590483#action_12590483
]
Pi Song commented on PIG-209:
-----------------------------
First question is when the data changes, how do we maintain indexes? Pig is not
a DBMS but more like a processing engine. Our data is stored as files on HDFS
(so logically just a file system) therefore when we've got new data coming in,
currently there is no mechanism to notify Pig. That is, it seems like
maintaining indexes would be a bit difficult to do for ever-changing data.
Possibly one way to do for increment-only data is to keep track of what we've
indexed and index only new data separately (Of course, we have to maintain the
semantic that the indexed data cannot be changed). Here comes another problem,
accessing random data (pointers to actual records from index entries) on file
system seems to be inefficient because we cannot exploit locality. Though, if
we consider only in cases of frequent joins of a small table to the huge table,
that might help us reduce the whole work load by cutting down processing of
unlikely fragments of the huge table.
Another way is if the small table is really small, what we could do is just
ship this small table to Map nodes and do local join. This way we don't need to
maintain indexes.
This is an interesting topic. Let's think and discuss more! Also, welcome to
our Pig ommunitycay :)
> Indexes for accelerating joins
> ------------------------------
>
> Key: PIG-209
> URL: https://issues.apache.org/jira/browse/PIG-209
> Project: Pig
> Issue Type: New Feature
> Components: data
> Reporter: John DeTreville
>
> Computing the inner join of a very large table (i.e., bag or mapping) with a
> smaller table can take time proportional to the size of the very large table.
> This time required can be greatly reduced if the very large table is indexed,
> taking time proportional to the size of the smaller table. It should be
> possible for clients to index tables for use by future joins.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.