Re: [DISCUSS] Unification of Hadoop related IO modules

Robert Bradshaw Fri, 07 Sep 2018 02:32:18 -0700

I think it makes sense to keep *hadoop-file-system* separate, as it's
common to use HDFS even if one is not using any of the other hadoop
(mapreduce) libraries. On the other hand, it makes a lot of sense to me to
put the hadoop read and write into the same module, probably going with
option (3) where *hadoop-input-format* would just be a (deprecated) alias
for *hadoop-mapreduce-format *until we can simply remove it. I don't know
enough about *hadoop-common* to judge whether it makes sense to merge it in
or just keep it separate.


On Thu, Sep 6, 2018 at 8:41 PM Lukasz Cwik <[email protected]> wrote:

> I think 4 is best for users since when a user comes from the Hadoop
> ecosystem, it is likely they are using many parts of Hadoop and would
> likely get value from having everything together. My concern with 4 is
> whether a single Hadoop package would be overwhelming from a dependencies
> point of view.
>
> From my experience with the google-cloud-platform IO package, it is not
> easy to handle this problem with so many different package versions and
> libraries and if we can't do that then the next best thing for me would be
> 2 or 3.
>
> On Thu, Sep 6, 2018 at 10:22 AM Chamikara Jayalath <[email protected]>
> wrote:
>
>> I'd vote for (1).
>>
>> For most of the IO modules, it makes sense to develop and keep read and
>> write parts together given that they usually connect to the same datastore.
>> But hadoop-input-format and hadoop-output-format are simply a level of
>> indirection to connect to various data stores supported by Hadoop. Also,
>> probably hadoop-format is not a common term used in Hadoop ecosystem ?
>>
>> hadoop-file-system is a FileSystem not a source/sink so makes sense to
>> keep it separate. Also looks like we have connectors for other products
>> from Hadoop ecosystem as separate modules.
>>
>> Regarding breaking changes, I think for IOs it's better to make old
>> classes proxies and keep them around (and deprecated) to not break users if
>> we decide to take that route.  For any non-experimental code we'll have to
>> keep old classes around till Beam 3.0.
>>
>> Thanks,
>> Cham
>>
>> On Thu, Sep 6, 2018 at 8:24 AM Alexey Romanenko <[email protected]>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I’d like to discuss the following topic (see below) with community since
>>> the optimal solution is not clear for me.
>>>
>>> There is Java IO module, called “*hadoop-input-format*”, which allows
>>> to use MapReduce InputFormat implementations to read data from different
>>> sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>> According to its name, it has only “Read" and it's missing “Write” part,
>>> so, I'm working on “*hadoop-output-format*” to support MapReduce
>>> OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>> this I created another module with this name. So, in the end, we will have
>>> two different modules “*hadoop-input-format*” and “
>>> *hadoop-output-format*” and it looks quite strange for me since, afaik,
>>> every existed Java IO, that we have, incapsulates Read and Write parts into
>>> one module. Additionally, we have “*hadoop-common*” and
>>> *“hadoop-file-system*” as other hadoop-related modules.
>>>
>>> Now I’m thinking how it will be better to organise all these Hadoop
>>> modules better. There are several options in my mind:
>>>
>>> 1) Add new module “*hadoop-output-format*” and leave all Hadoop modules
>>> “as it is”.
>>> Pros: no breaking changes, no additional work
>>> Cons: not logical for users to have the same IO in two different modules
>>> and with different names.
>>>
>>> 2) Merge “*hadoop-input-format*” and “*hadoop-output-format*” into one
>>> module called, say, “*hadoop-format*” or “*hadoop-mapreduce-format*”,
>>> keep the other Hadoop modules “as it is”.
>>> Pros: to have InputFormat/OutputFormat in one IO module which is logical
>>> for users
>>> Cons: breaking changes for user code because of module/IO renaming
>>>
>>> 3) Add new module “*hadoop-format*” (or “*hadoop-mapreduce-format*”)
>>> which will include new “write” functionality and be a proxy for old “
>>> *hadoop-input-format*”. In its turn, “*hadoop-input-format*” should
>>> become deprecated and be finally moved to common “*hadoop-format*”
>>> module in future releases. Keep the other Hadoop modules “as it is”.
>>> Pros: finally it will be only one module for hadoop MR format; changes
>>> are less painful for user
>>> Cons: hidden difficulties of implementation this strategy; a bit
>>> confusing for user
>>>
>>> 4) Add new module “*hadoop*” and move all already existed modules there
>>> as submodules (like we have for “*io/google-cloud-platform*”), merge “
>>> *hadoop-input-format*” and “*hadoop-output-format*” into one module.
>>> Pros: unification of all hadoop-related modules
>>> Cons: breaking changes for user code, additional complexity with deps
>>> and testing
>>>
>>> 5) Your suggestion?..
>>>
>>> My personal preferences are lying between 2 and 3 (if 3 is possible).
>>>
>>> I’m wondering if there were similar situations in Beam before and how it
>>> was finally resolved. If yes then probably we need to do here in similar
>>> way.
>>> Any suggestions/advices/comments would be very appreciated.
>>>
>>> Thanks,
>>> Alexey
>>>
>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to