Dear Wiki user, You have subscribed to a wiki page or wiki category on "Pig Wiki" for change notification.
The "Pig070LoadStoreHowTo" page has been changed by PradeepKamath. http://wiki.apache.org/pig/Pig070LoadStoreHowTo?action=diff&rev1=2&rev2=3 -------------------------------------------------- return only required fields, it should implement LoadPushDown to improve query performance. * !LoadCaster has methods to convert byte arrays to specific types. A loader implementation should implement this if casts (implicit or explicit) from !DataByteArray fields to other types need to be supported. - The !LoadFunc abstract class + The !LoadFunc abstract class is the main class to extend to implement a loader. The methods which need to be overriden are explained below: + * getInputFormat() :This method will be called by Pig to get the !InputFormat used by the loader. The methods in the !InputFormat (and underlying !RecordReader) will be called by pig in the same manner (and in the same context) + as by Hadoop in a map-reduce java program. If the !InputFormat is a hadoop packaged one, the implementation should use the new API based one under org.apache.hadoop.mapreduce. If it is a custom !InputFormat, it should be + implemented using the new API in org.apache.hadoop.mapreduce. + * setLocation() :This method is called by Pig to communicate the load location to the loader. The loader should use this method to communicate the same information to the underlying !InputFormat. This method is called multiple + times by pig - implementations should bear in mind that this method is called multiple times and should ensure there are no inconsistent side effects due to the multiple calls. + * prepareToRead() : Through this method the !RecordReader associated with the !InputFormat provided by the !LoadFunc is passed to the !LoadFunc. The !RecordReader can then be used by the implementation in getNext() to return a + tuple representing a record of data back to pig. + * getNext() :The meaning of getNext() has not changed and is called by Pig runtime to get the next tuple in the data - in the new API, this is the method wherein the implementation will use the the underlying !RecordReader + and construct a tuple + + The following methods have default implementations in !LoadFunc and should be overridden only if needed: + * setUdfContextSignature():This method will be called by Pig both in the front end and back end to pass a unique signature to the Loader. The signature can be used to store into the !UDFContext any information which the + Loader needs to store between various method invocations in the front end and back end. A use case is to store !RequiredFieldList passed to it in !LoadPushDown.pushProjection(RequiredFieldList) for use in the back end before + returning tuples in getNext(). The default implementation in !LoadFunc has an empty body. This method will be called before other methods. + * relativeToAbsolutePath():Pig runtime will call this method to allow the Loader to convert a relative load location to an absolute location. The default implementation provided in !LoadFunc handles this for !FileSystem + locations. If the load source is something else, loader implementation may choose to override this. + + === Example Implementation === + The loader implementation in the example is a loader for text data with line delimiter as '\n' and '\t' as default field delimiter (which can be overridden by passing a different field delimiter in the constructor) - + this is similar to current !PigStorage loader in Pig. The new implementation uses an existing Hadoop supported !Inputformat - !TextInputFormat as the underlying !InputFormat. + + {{{ + public class SimpleTextLoader extends LoadFunc { + protected RecordReader in = null; + private byte fieldDel = '\t'; + private ArrayList<Object> mProtoTuple = null; + private TupleFactory mTupleFactory = TupleFactory.getInstance(); + private static final int BUFFER_SIZE = 1024; + + public SimpleTextLoader() { + } + + /** + * Constructs a Pig loader that uses specified character as a field delimiter. + * + * @param delimiter + * the single byte character that is used to separate fields. + * ("\t" is the default.) + */ + public SimpleTextLoader(String delimiter) { + this(); + if (delimiter.length() == 1) { + this.fieldDel = (byte)delimiter.charAt(0); + } else if (delimiter.length() > 1 && delimiter.charAt(0) == '\\') { + switch (delimiter.charAt(1)) { + case 't': + this.fieldDel = (byte)'\t'; + break; + + case 'x': + fieldDel = + Integer.valueOf(delimiter.substring(2), 16).byteValue(); + break; + + case 'u': + this.fieldDel = + Integer.valueOf(delimiter.substring(2)).byteValue(); + break; + + default: + throw new RuntimeException("Unknown delimiter " + delimiter); + } + } else { + throw new RuntimeException("PigStorage delimeter must be a single character"); + } + } + + @Override + public Tuple getNext() throws IOException { + try { + boolean notDone = in.nextKeyValue(); + if (!notDone) { + return null; + } + Text value = (Text) in.getCurrentValue(); + byte[] buf = value.getBytes(); + int len = value.getLength(); + int start = 0; + + for (int i = 0; i < len; i++) { + if (buf[i] == fieldDel) { + readField(buf, start, i); + start = i + 1; + } + } + // pick up the last field + readField(buf, start, len); + + Tuple t = mTupleFactory.newTupleNoCopy(mProtoTuple); + mProtoTuple = null; + return t; + } catch (InterruptedException e) { + int errCode = 6018; + String errMsg = "Error while reading input"; + throw new ExecException(errMsg, errCode, + PigException.REMOTE_ENVIRONMENT, e); + } + + } + + private void readField(byte[] buf, int start, int end) { + if (mProtoTuple == null) { + mProtoTuple = new ArrayList<Object>(); + } + + if (start == end) { + // NULL value + mProtoTuple.add(null); + } else { + mProtoTuple.add(new DataByteArray(buf, start, end)); + } + } + + @Override + public InputFormat getInputFormat() { + return new TextInputFormat(); + } + + @Override + public void prepareToRead(RecordReader reader, PigSplit split) { + in = reader; + } + + @Override + public void setLocation(String location, Job job) + throws IOException { + FileInputFormat.setInputPaths(job, location); + } + } + }}}