Support for HBase in pig would solve this imo without needing udfs by users ?

- Mridul

Pi Song (JIRA) wrote:
[ https://issues.apache.org/jira/browse/PIG-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590710#action_12590710 ]
Pi Song commented on PIG-210:
-----------------------------

I had started looking at this a bit before I switched to something else so I 
want to share what I think with you.

Like what I said in the other post, Pig is not a DBMS so it doesn't handle 
changes in data well. For data mining purpose, it's still useful that in some 
use cases you just want to take a snapshot of data and then try to explore it 
in different dimensions. Having column based file store will really help reduce 
the amount of data in the process.

A way to go is to implement this somewhere around 
LOAD/STORE/PigInput/PigOutput. Primarily your data will be in some forms which 
is not column based so the first thing we do is taking the source data files 
and use Pig to process it to column based files (using a special PigOutput). 
Then, later we can selectively read only columns we need through a special 
PigInput to a data mining operator (Well, at least in the future we must have 
CUBE. None of them exists at the moment). LOLoad has to be changed a bit to 
allow you to select only columns you need.

I got stuck before due to the way Hadoop generates output filenames.  At the 
time, I didn't really spend much time to explore but I believe there will be a 
way out. If anyone is interested please have a discussion. I should be more 
free next month and will get back to this again,


Column store
------------

                Key: PIG-210
                URL: https://issues.apache.org/jira/browse/PIG-210
            Project: Pig
         Issue Type: New Feature
         Components: data
           Reporter: John DeTreville

I believe that Pig stores its tables in row order, which is less efficient in 
space and time than column order in a data-mining system. Column stores can be 
more highly compressed, and can be read and written faster. It should be 
possible for clients to store their tables in column order.


Reply via email to