Re: [DISCUSS] Unification of Hadoop related IO modules

dharmendra pratap singh Tue, 11 Sep 2018 02:11:49 -0700

Hello Team,
Does this mean, as of today we can read from Hadoop FS but can't write to
Hadoop FS using Beam HDFS API ?


Regards
Dharmendra

On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <[email protected]>
wrote:

> Hello everyone,
>
> I’d like to discuss the following topic (see below) with community since
> the optimal solution is not clear for me.
>
> There is Java IO module, called “*hadoop-input-format*”, which allows to
> use MapReduce InputFormat implementations to read data from different
> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
> According to its name, it has only “Read" and it's missing “Write” part,
> so, I'm working on “*hadoop-output-format*” to support MapReduce
> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
> this I created another module with this name. So, in the end, we will have
> two different modules “*hadoop-input-format*” and “*hadoop-output-format*”
> and it looks quite strange for me since, afaik, every existed Java IO, that
> we have, incapsulates Read and Write parts into one module. Additionally,
> we have “*hadoop-common*” and *“hadoop-file-system*” as other
> hadoop-related modules.
>
> Now I’m thinking how it will be better to organise all these Hadoop
> modules better. There are several options in my mind:
>
> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
> “as it is”.
> Pros: no breaking changes, no additional work
> Cons: not logical for users to have the same IO in two different modules
> and with different names.
>
> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
> keep the other Hadoop modules “as it is”.
> Pros: to have InputFormat/OutputFormat in one IO module which is logical
> for users
> Cons: breaking changes for user code because of module/IO renaming
>
> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
> which will include new “write” functionality and be a proxy for old “
> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
> become deprecated and be finally moved to common “*hadoop-format*” module
> in future releases. Keep the other Hadoop modules “as it is”.
> Pros: finally it will be only one module for hadoop MR format; changes are
> less painful for user
> Cons: hidden difficulties of implementation this strategy; a bit confusing
> for user
>
> 4) Add new module “*hadoop*” and move all already existed modules there
> as submodules (like we have for “*io/google-cloud-platform*”), merge “
> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
> Pros: unification of all hadoop-related modules
> Cons: breaking changes for user code, additional complexity with deps and
> testing
>
> 5) Your suggestion?..
>
> My personal preferences are lying between 2 and 3 (if 3 is possible).
>
> I’m wondering if there were similar situations in Beam before and how it
> was finally resolved. If yes then probably we need to do here in similar
> way.
> Any suggestions/advices/comments would be very appreciated.
>
> Thanks,
> Alexey
>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to