[ https://issues.apache.org/jira/browse/SPARK-25405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sean Owen resolved SPARK-25405. ------------------------------- Resolution: Not A Problem Looks like you are using the old Mapreduce OutputFormat classes with the 'new' API methods in Spark. This isn't a bug. Use the newer OutputFormat implementations under .mapreduce > Saving RDD with new Hadoop API file as a Sequence File too restrictive > ---------------------------------------------------------------------- > > Key: SPARK-25405 > URL: https://issues.apache.org/jira/browse/SPARK-25405 > Project: Spark > Issue Type: Bug > Components: Input/Output > Affects Versions: 2.2.0 > Reporter: Marcin Gasior > Priority: Major > > I tried to transform Hbase export (sequence file) using spark job, and face a > compilation issue: > > {code:java} > val hc = sc.hadoopConfiguration > val serializers = List( > classOf[WritableSerialization].getName, > classOf[ResultSerialization].getName > ).mkString(",") > hc.set("io.serializations", serializers) > val c = new Configuration(sc.hadoopConfiguration) > c.set("mapred.input.dir", sourcePath) > val subsetRDD = sc.newAPIHadoopRDD( > c, > classOf[SequenceFileInputFormat[ImmutableBytesWritable, Result]], > classOf[ImmutableBytesWritable], > classOf[Result]) > subsetRDD.saveAsNewAPIHadoopFile( > "output/sequence", > classOf[ImmutableBytesWritable], > classOf[Result], > classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]], > hc > ) > {code} > > > During compilation I received: > {code:java} > Error: type mismatch > Class[org.apache.hadoop.mapred.SequenceFileOutputFormat[org.apache.hadoop.hbase.io.ImmutableBytesWritable,org.apache.hadoop.hbase.client.Result]](classOf[org.apache.hadoop.mapred.SequenceFileOutputFormat]) > > required: Class[_ <: org.apache.hadoop.mapreduce.OutputFormat[_, _]] > classOf[SequenceFileOutputFormat[ImmutableBytesWritable, Result]],{code} > > By using Hadoop low-level api I could workaround the issue with following: > {code:java} > val writer = SequenceFile.createWriter(hc, Writer.file(new Path(“sample")), > Writer.keyClass(classOf[ImmutableBytesWritable]), > Writer.valueClass(classOf[Result]), > Writer.bufferSize(fs.getConf().getInt("io.file.buffer.size",4096)), > Writer.replication(fs.getDefaultReplication()), > Writer.blockSize(1073741824), > Writer.compression(SequenceFile.CompressionType.BLOCK, new DefaultCodec()), > Writer.progressable(null), > Writer.metadata(new Metadata())) > subset.foreach(p => writer.append(p._1, p._2)) > IOUtils.closeStream(writer) > {code} > > I think that the interface is too restrictive, and does not allow to pass > external (de)serializers > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org