We ended up implementing custom Hadoop InputFormats and RecordReaders by
extending FileInputFormat / RecordReader, and using sc.newAPIHadoopFile to
read it as an RDD.
On Wed, Mar 9, 2016 at 9:15 AM Ruslan Dautkhanov
wrote:
> We have a huge binary file in a custom serialization format (e.g. heade
bq. there is a varying number of items for that record
If the combination of items is very large, using case class would be
tedious.
On Wed, Mar 9, 2016 at 9:57 AM, Saurabh Bajaj
wrote:
> You can load that binary up as a String RDD, then map over that RDD and
> convert each row to your case cla
You can load that binary up as a String RDD, then map over that RDD and
convert each row to your case class representing the data. In the map stage
you could also map the input string into an RDD of JSON values and use the
following function to convert it into a DF
http://spark.apache.org/docs/late
We have a huge binary file in a custom serialization format (e.g. header
tells the length of the record, then there is a varying number of items for
that record). This is produced by an old c++ application.
What would be best approach to deserialize it into a Hive table or a Spark
RDD?
Format is kn