Hi Niels,

Jörn is right, although offering different methods, Flink's InputFormat is
very similar to Hadoop's InputFormat interface.
The InputFormat.createInputSplits() method generates splits that can be
read in parallel.
The FileInputFormat splits files by fixed boundaries (usually HDFS
blocksize) and expects the InputFormat to find the right place to start and
end.
For line-wise read files (TextInputFormat) or files with a record delimiter
(DelimiterInputFormat), the formats read the first record after they found
the first delimiter in their split and stop at the first delimiter after
the split boundary.
The BinaryInputFormat extends FileInputFormat but overrides the
createInputSplits method.

So, how exactly a file is read in parallel depends on the
createInputSplits() method of the InputFormat.

Hope this helps,
Fabian


2018-02-18 13:36 GMT+01:00 Jörn Franke <jornfra...@gmail.com>:

> AFAIK Flink has a similar notion of splittable as Hadoop. Furthermore you
> can set for custom Fileibputformats the attribute unsplittable = true if
> your file format cannot be split
>
> > On 18. Feb 2018, at 13:28, Niels Basjes <ni...@basjes.nl> wrote:
> >
> > Hi,
> >
> > In Hadoop MapReduce there is the notion of "splittable" in the
> > FileInputFormat. This has the effect that a single input file can be fed
> > into multiple separate instances of the mapper that read the data.
> > A lot has been documented (i.e. text is splittable per line, gzipped text
> > is not splittable) and designed into the various file formats (like Avro
> > and Parquet) to allow splittability.
> >
> > The goal is that reading and parsing files can be done by multiple
> > cpus/systems in parallel.
> >
> > How is this handled in Flink?
> > Can Flink read a single file in parallel?
> > How does Flink administrate/handle the possibilities regarding the
> various
> > file formats?
> >
> >
> > The reason I ask is because I want to see if I can port this (now Hadoop
> > specific) hobby project of mine to work with Flink:
> > https://github.com/nielsbasjes/splittablegzip
> >
> > Thanks.
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
>

Reply via email to