+1 In general I think you would just need to parse the interesting fields via a java pcap format reader (or do the byte reading yourself, the format is pretty standard: http://wiki.wireshark.org/Development/LibpcapFileFormat), put them into a Writeable object and write them to the HDFS via SequenceFile format.
Another option is using a binary serialization package such as avro, thrift or protobuf and writing the serialized form to the HDFS. You would then need to write your own InputFormat/RecordReader for it, or wait for http://issues.apache.org/jira/browse/MAPREDUCE-377 or some other native support. Will On Wed, Jul 29, 2009 at 7:21 PM, Ariel Rabkin<[email protected]> wrote: > I remember looking at this some months back. > > My recollection is that PCAP is a somewhat awkward format to > MapReduce, since it isn't splittable -- you can't find record > boundaries, if you start at a random offset. > > You may want to do some sort of preprocessing, before you upload your > logs to HDFS to fix this. Irritatingly, the existing code I've seen > for processing PCAP files doesn't seem very friendly to parsing > arbitrary packet-trace data in-memory. > > --Ari > > On Tue, Jul 28, 2009 at 8:31 AM, Wasim Bari<[email protected]> wrote: >> >> >> >> >> >> Hi, >> >> I have data in PCAP file format (packet capture for network trafficc). Is >> it possible to process this file in Hadoop in same format ? Or any >> supporting tool over hadoop to analyze data from PCAP files ? >> >> >> >> >> >> Bye >> >> >> >> Wasim >> > > > > -- > Ari Rabkin [email protected] > UC Berkeley Computer Science Department >
