Hello Team, Does this mean, as of today we can read from Hadoop FS but can't write to Hadoop FS using Beam HDFS API ?
Regards Dharmendra On Thu, Sep 6, 2018 at 8:54 PM Alexey Romanenko <[email protected]> wrote: > Hello everyone, > > I’d like to discuss the following topic (see below) with community since > the optimal solution is not clear for me. > > There is Java IO module, called “*hadoop-input-format*”, which allows to > use MapReduce InputFormat implementations to read data from different > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). > According to its name, it has only “Read" and it's missing “Write” part, > so, I'm working on “*hadoop-output-format*” to support MapReduce > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For > this I created another module with this name. So, in the end, we will have > two different modules “*hadoop-input-format*” and “*hadoop-output-format*” > and it looks quite strange for me since, afaik, every existed Java IO, that > we have, incapsulates Read and Write parts into one module. Additionally, > we have “*hadoop-common*” and *“hadoop-file-system*” as other > hadoop-related modules. > > Now I’m thinking how it will be better to organise all these Hadoop > modules better. There are several options in my mind: > > 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules > “as it is”. > Pros: no breaking changes, no additional work > Cons: not logical for users to have the same IO in two different modules > and with different names. > > 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one > module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”, > keep the other Hadoop modules “as it is”. > Pros: to have InputFormat/OutputFormat in one IO module which is logical > for users > Cons: breaking changes for user code because of module/IO renaming > > 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”) > which will include new “write” functionality and be a proxy for old “ > *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should > become deprecated and be finally moved to common “*hadoop-format*” module > in future releases. Keep the other Hadoop modules “as it is”. > Pros: finally it will be only one module for hadoop MR format; changes are > less painful for user > Cons: hidden difficulties of implementation this strategy; a bit confusing > for user > > 4) Add new module “*hadoop*” and move all already existed modules there > as submodules (like we have for “*io/google-cloud-platform*”), merge “ > *hadoop-input-format*” and “*hadoop-output-format*” into one module. > Pros: unification of all hadoop-related modules > Cons: breaking changes for user code, additional complexity with deps and > testing > > 5) Your suggestion?.. > > My personal preferences are lying between 2 and 3 (if 3 is possible). > > I’m wondering if there were similar situations in Beam before and how it > was finally resolved. If yes then probably we need to do here in similar > way. > Any suggestions/advices/comments would be very appreciated. > > Thanks, > Alexey >
