[ 
https://issues.apache.org/jira/browse/PIG-210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12590710#action_12590710
 ] 

Pi Song commented on PIG-210:
-----------------------------

I had started looking at this a bit before I switched to something else so I 
want to share what I think with you.

Like what I said in the other post, Pig is not a DBMS so it doesn't handle 
changes in data well. For data mining purpose, it's still useful that in some 
use cases you just want to take a snapshot of data and then try to explore it 
in different dimensions. Having column based file store will really help reduce 
the amount of data in the process.

A way to go is to implement this somewhere around 
LOAD/STORE/PigInput/PigOutput. Primarily your data will be in some forms which 
is not column based so the first thing we do is taking the source data files 
and use Pig to process it to column based files (using a special PigOutput). 
Then, later we can selectively read only columns we need through a special 
PigInput to a data mining operator (Well, at least in the future we must have 
CUBE. None of them exists at the moment). LOLoad has to be changed a bit to 
allow you to select only columns you need.

I got stuck before due to the way Hadoop generates output filenames.  At the 
time, I didn't really spend much time to explore but I believe there will be a 
way out. If anyone is interested please have a discussion. I should be more 
free next month and will get back to this again,


> Column store
> ------------
>
>                 Key: PIG-210
>                 URL: https://issues.apache.org/jira/browse/PIG-210
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>            Reporter: John DeTreville
>
> I believe that Pig stores its tables in row order, which is less efficient in 
> space and time than column order in a data-mining system. Column stores can 
> be more highly compressed, and can be read and written faster. It should be 
> possible for clients to store their tables in column order.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to