Support for HBase in pig would solve this imo without needing udfs by
users ?
- Mridul
Pi Song (JIRA) wrote:
[ https://issues.apache.org/jira/browse/PIG-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590710#action_12590710 ]
Pi Song commented on PIG-210:
-----------------------------
I had started looking at this a bit before I switched to something else so I
want to share what I think with you.
Like what I said in the other post, Pig is not a DBMS so it doesn't handle
changes in data well. For data mining purpose, it's still useful that in some
use cases you just want to take a snapshot of data and then try to explore it
in different dimensions. Having column based file store will really help reduce
the amount of data in the process.
A way to go is to implement this somewhere around
LOAD/STORE/PigInput/PigOutput. Primarily your data will be in some forms which
is not column based so the first thing we do is taking the source data files
and use Pig to process it to column based files (using a special PigOutput).
Then, later we can selectively read only columns we need through a special
PigInput to a data mining operator (Well, at least in the future we must have
CUBE. None of them exists at the moment). LOLoad has to be changed a bit to
allow you to select only columns you need.
I got stuck before due to the way Hadoop generates output filenames. At the
time, I didn't really spend much time to explore but I believe there will be a
way out. If anyone is interested please have a discussion. I should be more
free next month and will get back to this again,
Column store
------------
Key: PIG-210
URL: https://issues.apache.org/jira/browse/PIG-210
Project: Pig
Issue Type: New Feature
Components: data
Reporter: John DeTreville
I believe that Pig stores its tables in row order, which is less efficient in
space and time than column order in a data-mining system. Column stores can be
more highly compressed, and can be read and written faster. It should be
possible for clients to store their tables in column order.