Re: [DISCUSS] Unification of Hadoop related IO modules

Andrew Pilloud Fri, 07 Sep 2018 10:13:48 -0700

+1 for option 3. That approach will keep the mapping clean if SQL supports
this IO. It would be good to put the proxy in the old module and move the
implementation now. That way the old module can be easily deleted when the
time comes.


Andrew

On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw <[email protected]> wrote:

> OK, good, that's what I thought. So I stick by (3) which
>
> 1) Cleans up the library for all future uses (hopefully the majority of
> all users :).
> 2) Is fully backwards compatible for existing users, minimizing
> disruption, and giving them time to migrate.
>
> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <[email protected]>
> wrote:
>
>> In next release it will be still compatible because we keep
>> module “hadoop-input-format” but we make it deprecated and propose to use
>> it through module “hadoop-format” and proxy class HadoopFormatIO (or
>> HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read
>> functionality by using MapReduce InputFormat or OutputFormat classes.
>> Then, in future releases after next one, we can
>> drop “hadoop-input-format”  since it was deprecated and we provided a time
>> to move to new API. I think this is less painful way for user but most
>> complicated for us if the final goal it to merge “hadoop-input-format” and
>> “hadoop-output-format” together.
>>
>> On 7 Sep 2018, at 13:45, Robert Bradshaw <[email protected]> wrote:
>>
>> Agree about not impacting users. Perhaps I misread (3), isn't it fully
>> backwards compatible as well?
>>
>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> in order to limit the impact for the existing users on Beam 2.x series,
>>> I would go for (1).
>>>
>>> Regards
>>> JB
>>>
>>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>>> > Hello everyone,
>>> >
>>> > I’d like to discuss the following topic (see below) with community
>>> since
>>> > the optimal solution is not clear for me.
>>> >
>>> > There is Java IO module, called “/hadoop-input-format/”, which allows
>>> to
>>> > use MapReduce InputFormat implementations to read data from different
>>> > sources (for
>>> example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>> > According to its name, it has only “Read" and it's missing “Write”
>>> part,
>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>> > this I created another module with this name. So, in the end, we will
>>> > have two different modules “/hadoop-input-format/” and
>>> > “/hadoop-output-format/” and it looks quite strange for me since,
>>> afaik,
>>> > every existed Java IO, that we have, incapsulates Read and Write parts
>>> > into one module. Additionally, we have “/hadoop-common/” and
>>> > /“hadoop-file-system/” as other hadoop-related modules.
>>> >
>>> > Now I’m thinking how it will be better to organise all these Hadoop
>>> > modules better. There are several options in my mind:
>>> >
>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>>> > “as it is”.
>>> > Pros: no breaking changes, no additional work
>>> > Cons: not logical for users to have the same IO in two different
>>> modules
>>> > and with different names.
>>> >
>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>>> > keep the other Hadoop modules “as it is”.
>>> > Pros: to have InputFormat/OutputFormat in one IO module which is
>>> logical
>>> > for users
>>> > Cons: breaking changes for user code because of module/IO renaming
>>> >
>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>>> > which will include new “write” functionality and be a proxy for old
>>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>>> > become deprecated and be finally moved to common “/hadoop-format/”
>>> > module in future releases. Keep the other Hadoop modules “as it is”.
>>> > Pros: finally it will be only one module for hadoop MR format; changes
>>> > are less painful for user
>>> > Cons: hidden difficulties of implementation this strategy; a bit
>>> > confusing for user
>>> >
>>> > 4) Add new module “/hadoop/” and move all already existed modules there
>>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module.
>>> > Pros: unification of all hadoop-related modules
>>> > Cons: breaking changes for user code, additional complexity with deps
>>> > and testing
>>> >
>>> > 5) Your suggestion?..
>>> >
>>> > My personal preferences are lying between 2 and 3 (if 3 is possible).
>>> >
>>> > I’m wondering if there were similar situations in Beam before and how
>>> it
>>> > was finally resolved. If yes then probably we need to do here in
>>> similar
>>> > way.
>>> > Any suggestions/advices/comments would be very appreciated.
>>> >
>>> > Thanks,
>>> > Alexey
>>>
>>> --
>>> Jean-Baptiste Onofré
>>> [email protected]
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>
>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to