[ 
https://issues.apache.org/jira/browse/PIG-55?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12574311#action_12574311
 ] 

groves edited comment on PIG-55 at 3/2/08 3:50 PM:
-----------------------------------------------------------

Updates my previous patch to use the generic DataStorage classes instead of 
Hadoop specific code.  This fixes issues a and b that you raised.  Individual 
backends will have to implement something to create a Chunker and hook Chunks 
into their processing setup like PigInputFormat and ChunkWrapper in the patch 
do for hadoop, but any implementations of Chunk and Chunker should be backend 
agnostic.  As a bonus, PigChunker in the patch implements the default file 
selection and LoadFunc processing that used to be in PigInputFormat, but it 
should be instantiable by any backend so they can pick up normal Pig processing 
for free.

I think c and d are outside of the scope of this patch.  Both of those problems 
relate to sharing code to process the actual bytes from a Chunk, and that can 
be built on top of this change.  This is only concerned with exposing the 
determination of what files to read and what code should read them to user code 
from pig.

I made some minor modifications to the DataStorage code to allow easier access 
to the properties on an individual element as its actual type.  It seemed 
ridiculous to turn longs into strings only to immediately turn them back into 
longs all over the place.

The patch passes all of the tests for me.  It's awesome to go away for a month, 
svn update, and have the tests take a fifth the time to run that they used to.

      was (Author: groves):
    Updates my previous patch to use the generic DataStorage classes instead of 
Hadoop specific code.
  
> Allow user control over split creation
> --------------------------------------
>
>                 Key: PIG-55
>                 URL: https://issues.apache.org/jira/browse/PIG-55
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Charlie Groves
>         Attachments: pig_chunker_split.patch, replaceable_PigSplit.diff, 
> replaceable_PigSplit_v2.diff
>
>
> I have a dataset in HDFS that's stored in a file per column that I'd like to 
> access from pig.  This means I can't use LoadFunc to get at the data as it 
> only allows the loader access to a single input stream at a time.  To handle 
> this usage, I've broken the existing split creation code out into a few 
> classes and interfaces, and allowed user specified load functions to be used 
> in place of the existing code.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to