[ https://issues.apache.org/jira/browse/FLINK-1396?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14305914#comment-14305914 ]
ASF GitHub Bot commented on FLINK-1396: --------------------------------------- Github user fhueske commented on a diff in the pull request: https://github.com/apache/flink/pull/363#discussion_r24117856 --- Diff: docs/hadoop_compatibility.md --- @@ -52,56 +63,70 @@ Add the following dependency to your `pom.xml` to use the Hadoop Compatibility L ### Using Hadoop Data Types -Flink supports all Hadoop `Writable` and `WritableComparable` data types out-of-the-box. You do not need to include the Hadoop Compatibility dependency, if you only want to use your Hadoop data types. See the [Programming Guide](programming_guide.html#data-types) for more details. +Flink supports all Hadoop `Writable` and `WritableComparable` data types +out-of-the-box. You do not need to include the Hadoop Compatibility dependency, +if you only want to use your Hadoop data types. See the +[Programming Guide](programming_guide.html#data-types) for more details. ### Using Hadoop InputFormats -Flink provides a compatibility wrapper for Hadoop `InputFormats`. Any class that implements `org.apache.hadoop.mapred.InputFormat` or extends `org.apache.hadoop.mapreduce.InputFormat` is supported. Thus, Flink can handle Hadoop built-in formats such as `TextInputFormat` as well as external formats such as Hive's `HCatInputFormat`. Data read from Hadoop InputFormats is converted into a `DataSet<Tuple2<KEY,VALUE>>` where `KEY` is the key and `VALUE` is the value of the original Hadoop key-value pair. - -Flink's InputFormat wrappers are - -- `org.apache.flink.hadoopcompatibility.mapred.HadoopInputFormat` and -- `org.apache.flink.hadoopcompatibility.mapreduce.HadoopInputFormat` +Hadoop input formats can be used to create a data source by using +on of the methods `readHadoopFile` or `createHadoopInput` of the +`ExecutionEnvironment`. The former is used for input formats derived +from `FileInputFormat` while the latter has to be used for general purpose +input formats. -and can be used as regular Flink [InputFormats](programming_guide.html#data-sources). +The resulting `DataSet` contains 2-tuples where the first field +is the key and the second field is the value retrieved from the Hadoop +InputFormat. The following example shows how to use Hadoop's `TextInputFormat`. +<div class="codetabs" markdown="1"> +<div data-lang="java" markdown="1"> + ~~~java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); - -// Set up the Hadoop TextInputFormat. -Job job = Job.getInstance(); -HadoopInputFormat<LongWritable, Text> hadoopIF = - // create the Flink wrapper. - new HadoopInputFormat<LongWritable, Text>( - // create the Hadoop InputFormat, specify key and value type, and job. - new TextInputFormat(), LongWritable.class, Text.class, job - ); -TextInputFormat.addInputPath(job, new Path(inputPath)); - -// Read data using the Hadoop TextInputFormat. -DataSet<Tuple2<LongWritable, Text>> text = env.createInput(hadoopIF); + +DataSet<Tuple2<LongWritable, Text>> input = + env.readHadoopFile(new TextInputFormat(), LongWritable.class, Text.class, textPath); // Do something with the data. [...] ~~~ -### Using Hadoop OutputFormats +</div> +<div data-lang="scala" markdown="1"> -Flink provides a compatibility wrapper for Hadoop `OutputFormats`. Any class that implements `org.apache.hadoop.mapred.OutputFormat` or extends `org.apache.hadoop.mapreduce.OutputFormat` is supported. The OutputFormat wrapper expects its input data to be a `DataSet<Tuple2<KEY,VALUE>>` where `KEY` is the key and `VALUE` is the value of the Hadoop key-value pair that is processed by the Hadoop OutputFormat. +~~~scala +val env = ExecutionEnvironment.getExecutionEnvironment + +val input: DataSet[(LongWritable, Text)] = + env.readHadoopFile(new TextInputFormat, classOf[LongWritable], classOf[Text], textPath) -Flink's OUtputFormat wrappers are +// Do something with the data. +[...] +~~~ + +</div> -- `org.apache.flink.hadoopcompatibility.mapred.HadoopOutputFormat` and -- `org.apache.flink.hadoopcompatibility.mapreduce.HadoopOutputFormat` +</div> + +### Using Hadoop OutputFormats -and can be used as regular Flink [OutputFormats](programming_guide.html#data-sinks). +Flink provides a compatibility wrapper for Hadoop `OutputFormats`. Any class +that implements `org.apache.hadoop.mapred.OutputFormat` or extend --- End diff -- extend -> extends > Add hadoop input formats directly to the user API. > -------------------------------------------------- > > Key: FLINK-1396 > URL: https://issues.apache.org/jira/browse/FLINK-1396 > Project: Flink > Issue Type: Bug > Reporter: Robert Metzger > Assignee: Aljoscha Krettek > -- This message was sent by Atlassian JIRA (v6.3.4#6332)