Re: [DISCUSS] Unification of Hadoop related IO modules

Tim Fri, 07 Sep 2018 10:50:15 -0700

Another +1 for option 3 (and preference of HadoopFormatIO naming).

Thanks Alexey,


Tim


> On 7 Sep 2018, at 19:13, Andrew Pilloud <[email protected]> wrote:
> 
> +1 for option 3. That approach will keep the mapping clean if SQL supports 
> this IO. It would be good to put the proxy in the old module and move the 
> implementation now. That way the old module can be easily deleted when the 
> time comes.
> 
> Andrew
> 
>> On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw <[email protected]> wrote:
>> OK, good, that's what I thought. So I stick by (3) which
>> 
>> 1) Cleans up the library for all future uses (hopefully the majority of all 
>> users :). 
>> 2) Is fully backwards compatible for existing users, minimizing disruption, 
>> and giving them time to migrate. 
>> 
>>> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <[email protected]> 
>>> wrote:
>>> In next release it will be still compatible because we keep module 
>>> “hadoop-input-format” but we make it deprecated and propose to use it 
>>> through module “hadoop-format” and proxy class HadoopFormatIO (or 
>>> HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read 
>>> functionality by using MapReduce InputFormat or OutputFormat classes. 
>>> Then, in future releases after next one, we can drop “hadoop-input-format”  
>>> since it was deprecated and we provided a time to move to new API. I think 
>>> this is less painful way for user but most complicated for us if the final 
>>> goal it to merge “hadoop-input-format” and “hadoop-output-format” together.
>>> 
>>>> On 7 Sep 2018, at 13:45, Robert Bradshaw <[email protected]> wrote:
>>>> 
>>>> Agree about not impacting users. Perhaps I misread (3), isn't it fully 
>>>> backwards compatible as well? 
>>>> 
>>>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <[email protected]> 
>>>> wrote:
>>>>> Hi,
>>>>> 
>>>>> in order to limit the impact for the existing users on Beam 2.x series,
>>>>> I would go for (1).
>>>>> 
>>>>> Regards
>>>>> JB
>>>>> 
>>>>> On 06/09/2018 17:24, Alexey Romanenko wrote:
>>>>> > Hello everyone,
>>>>> > 
>>>>> > I’d like to discuss the following topic (see below) with community since
>>>>> > the optimal solution is not clear for me.
>>>>> > 
>>>>> > There is Java IO module, called “/hadoop-input-format/”, which allows to
>>>>> > use MapReduce InputFormat implementations to read data from different
>>>>> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat).
>>>>> > According to its name, it has only “Read" and it's missing “Write” part,
>>>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce
>>>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For
>>>>> > this I created another module with this name. So, in the end, we will
>>>>> > have two different modules “/hadoop-input-format/” and
>>>>> > “/hadoop-output-format/” and it looks quite strange for me since, afaik,
>>>>> > every existed Java IO, that we have, incapsulates Read and Write parts
>>>>> > into one module. Additionally, we have “/hadoop-common/” and
>>>>> > /“hadoop-file-system/” as other hadoop-related modules. 
>>>>> > 
>>>>> > Now I’m thinking how it will be better to organise all these Hadoop
>>>>> > modules better. There are several options in my mind: 
>>>>> > 
>>>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules
>>>>> > “as it is”. 
>>>>> > Pros: no breaking changes, no additional work 
>>>>> > Cons: not logical for users to have the same IO in two different modules
>>>>> > and with different names.
>>>>> > 
>>>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one
>>>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”,
>>>>> > keep the other Hadoop modules “as it is”.
>>>>> > Pros: to have InputFormat/OutputFormat in one IO module which is logical
>>>>> > for users
>>>>> > Cons: breaking changes for user code because of module/IO renaming 
>>>>> > 
>>>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”)
>>>>> > which will include new “write” functionality and be a proxy for old
>>>>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should
>>>>> > become deprecated and be finally moved to common “/hadoop-format/”
>>>>> > module in future releases. Keep the other Hadoop modules “as it is”.
>>>>> > Pros: finally it will be only one module for hadoop MR format; changes
>>>>> > are less painful for user
>>>>> > Cons: hidden difficulties of implementation this strategy; a bit
>>>>> > confusing for user 
>>>>> > 
>>>>> > 4) Add new module “/hadoop/” and move all already existed modules there
>>>>> > as submodules (like we have for “/io/google-cloud-platform/”), merge
>>>>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module. 
>>>>> > Pros: unification of all hadoop-related modules
>>>>> > Cons: breaking changes for user code, additional complexity with deps
>>>>> > and testing
>>>>> > 
>>>>> > 5) Your suggestion?..
>>>>> > 
>>>>> > My personal preferences are lying between 2 and 3 (if 3 is possible). 
>>>>> > 
>>>>> > I’m wondering if there were similar situations in Beam before and how it
>>>>> > was finally resolved. If yes then probably we need to do here in similar
>>>>> > way.
>>>>> > Any suggestions/advices/comments would be very appreciated.
>>>>> > 
>>>>> > Thanks,
>>>>> > Alexey
>>>>> 
>>>>> -- 
>>>>> Jean-Baptiste Onofré
>>>>> [email protected]
>>>>> http://blog.nanthrax.net
>>>>> Talend - http://www.talend.com
>>>

Re: [DISCUSS] Unification of Hadoop related IO modules

Reply via email to