Hey Mike, Sorry about that, it's mainly b/c they're tedious to write and I've been lazy about it. Here's the skinny.
For the SeqFileSource, we assume that you're only interested in the "value" portion of the key-value pair for each record in the SequenceFile. The PType<T> should be for whatever data type you expect to read from that value, which is probably a class that implements Writable. The easy way to do it is to do: import static org.apache.crunch.types.writable.Writables.writables; import org.apache.crunch.io.From; // This reads the value and ignore the key in each record PCollection<MyWritable> in = pipeline.read(From.sequenceFile(<path>, writables(MyWritable.class))); If you want both the key and the value, you need to read the SequenceFile as a PTable<K, V>, as: PTable<MyKey, MyValue> in = pipeline.read(From.sequenceFile(<path>, writables(MyKey.class), writables(MyValue.class))); After you read in the values, you're free to convert them to whatever types you like using parallelDo and friends. I especially recommend using the Avro-based PTypeFamily, since it will significantly outperform the Writable family on jobs that involve complex joins or aggregations. Hope that helps, feel free to send follow-ups. Josh On Mon, Dec 3, 2012 at 2:25 PM, Mike Barretta <[email protected]>wrote: > As there are no examples on using non-text files as input, I'm trying to > piece together the steps involved in reading in sequence data. > > The main piece looks to be the SeqFileSource (as of 0.5 snapshot) which > takes a path and a PType. The PType is where my confusion begins. > > How does PType relate to InputFormat and OutputFormat? Do I need to > implement my own PTypes and the associated in/out MapFns? > > Thanks, > Mike > > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
