Another +1 for option 3 (and preference of HadoopFormatIO naming). Thanks Alexey,
Tim > On 7 Sep 2018, at 19:13, Andrew Pilloud <apill...@google.com> wrote: > > +1 for option 3. That approach will keep the mapping clean if SQL supports > this IO. It would be good to put the proxy in the old module and move the > implementation now. That way the old module can be easily deleted when the > time comes. > > Andrew > >> On Fri, Sep 7, 2018 at 6:15 AM Robert Bradshaw <rober...@google.com> wrote: >> OK, good, that's what I thought. So I stick by (3) which >> >> 1) Cleans up the library for all future uses (hopefully the majority of all >> users :). >> 2) Is fully backwards compatible for existing users, minimizing disruption, >> and giving them time to migrate. >> >>> On Fri, Sep 7, 2018 at 2:51 PM Alexey Romanenko <aromanenko....@gmail.com> >>> wrote: >>> In next release it will be still compatible because we keep module >>> “hadoop-input-format” but we make it deprecated and propose to use it >>> through module “hadoop-format” and proxy class HadoopFormatIO (or >>> HadoopMapReduceFormatIO, whatever we name it) which will provide Write/Read >>> functionality by using MapReduce InputFormat or OutputFormat classes. >>> Then, in future releases after next one, we can drop “hadoop-input-format” >>> since it was deprecated and we provided a time to move to new API. I think >>> this is less painful way for user but most complicated for us if the final >>> goal it to merge “hadoop-input-format” and “hadoop-output-format” together. >>> >>>> On 7 Sep 2018, at 13:45, Robert Bradshaw <rober...@google.com> wrote: >>>> >>>> Agree about not impacting users. Perhaps I misread (3), isn't it fully >>>> backwards compatible as well? >>>> >>>> On Fri, Sep 7, 2018 at 1:33 PM Jean-Baptiste Onofré <j...@nanthrax.net> >>>> wrote: >>>>> Hi, >>>>> >>>>> in order to limit the impact for the existing users on Beam 2.x series, >>>>> I would go for (1). >>>>> >>>>> Regards >>>>> JB >>>>> >>>>> On 06/09/2018 17:24, Alexey Romanenko wrote: >>>>> > Hello everyone, >>>>> > >>>>> > I’d like to discuss the following topic (see below) with community since >>>>> > the optimal solution is not clear for me. >>>>> > >>>>> > There is Java IO module, called “/hadoop-input-format/”, which allows to >>>>> > use MapReduce InputFormat implementations to read data from different >>>>> > sources (for example, org.apache.hadoop.mapreduce.lib.db.DBInputFormat). >>>>> > According to its name, it has only “Read" and it's missing “Write” part, >>>>> > so, I'm working on “/hadoop-output-format/” to support MapReduce >>>>> > OutputFormat (PR 6306 <https://github.com/apache/beam/pull/6306>). For >>>>> > this I created another module with this name. So, in the end, we will >>>>> > have two different modules “/hadoop-input-format/” and >>>>> > “/hadoop-output-format/” and it looks quite strange for me since, afaik, >>>>> > every existed Java IO, that we have, incapsulates Read and Write parts >>>>> > into one module. Additionally, we have “/hadoop-common/” and >>>>> > /“hadoop-file-system/” as other hadoop-related modules. >>>>> > >>>>> > Now I’m thinking how it will be better to organise all these Hadoop >>>>> > modules better. There are several options in my mind: >>>>> > >>>>> > 1) Add new module “/hadoop-output-format/” and leave all Hadoop modules >>>>> > “as it is”. >>>>> > Pros: no breaking changes, no additional work >>>>> > Cons: not logical for users to have the same IO in two different modules >>>>> > and with different names. >>>>> > >>>>> > 2) Merge “/hadoop-input-format/” and “/hadoop-output-format/” into one >>>>> > module called, say, “/hadoop-format/” or “/hadoop-mapreduce-format/”, >>>>> > keep the other Hadoop modules “as it is”. >>>>> > Pros: to have InputFormat/OutputFormat in one IO module which is logical >>>>> > for users >>>>> > Cons: breaking changes for user code because of module/IO renaming >>>>> > >>>>> > 3) Add new module “/hadoop-format/” (or “/hadoop-mapreduce-format/”) >>>>> > which will include new “write” functionality and be a proxy for old >>>>> > “/hadoop-input-format/”. In its turn, “/hadoop-input-format/” should >>>>> > become deprecated and be finally moved to common “/hadoop-format/” >>>>> > module in future releases. Keep the other Hadoop modules “as it is”. >>>>> > Pros: finally it will be only one module for hadoop MR format; changes >>>>> > are less painful for user >>>>> > Cons: hidden difficulties of implementation this strategy; a bit >>>>> > confusing for user >>>>> > >>>>> > 4) Add new module “/hadoop/” and move all already existed modules there >>>>> > as submodules (like we have for “/io/google-cloud-platform/”), merge >>>>> > “/hadoop-input-format/” and “/hadoop-output-format/” into one module. >>>>> > Pros: unification of all hadoop-related modules >>>>> > Cons: breaking changes for user code, additional complexity with deps >>>>> > and testing >>>>> > >>>>> > 5) Your suggestion?.. >>>>> > >>>>> > My personal preferences are lying between 2 and 3 (if 3 is possible). >>>>> > >>>>> > I’m wondering if there were similar situations in Beam before and how it >>>>> > was finally resolved. If yes then probably we need to do here in similar >>>>> > way. >>>>> > Any suggestions/advices/comments would be very appreciated. >>>>> > >>>>> > Thanks, >>>>> > Alexey >>>>> >>>>> -- >>>>> Jean-Baptiste Onofré >>>>> jbono...@apache.org >>>>> http://blog.nanthrax.net >>>>> Talend - http://www.talend.com >>>