Have you considered expanding TextIO to support an arbitrary delimiter
instead of defining MultiLineIO?
https://github.com/apache/incubator-beam/blob/master/sdk/src/main/java/com/google/cloud/dataflow/sdk/io/TextIO.java#L737

TextIO currently splits on '\n', '\r\n', or '\r'. It seems as though having
it split on any arbitrary delimiter would be useful.
Note that even though TextIO implies its used for strings, this is not
necessarily required since a user can use any coder to decode the bytes
between two delimiters.

On Thu, Mar 17, 2016 at 12:53 AM, Dan Halperin <[email protected]>
wrote:

> Hi Peter,
>
> Echoing Eugene's and JB's thoughts -- we'd love a PR!
>
> I also wanted to say: we've hit you with a lot of recommendations in this
> email thread. If you have any questions, you can ask us here -- but we'll
> of course be happy to answer them during code review as well. Do not feel
> like meeting all these many criteria is a pre-requisite for opening a Pull
> Request -- we just may give you feedback and ask for changes before merging
> :).
>
> Thanks!
> Dan
>
> On Mon, Mar 14, 2016 at 12:27 PM, Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Yes, you already use the "new style" as you use BoundedSource.
> >
> > Regards
> > JB
> >
> >
> > On 03/14/2016 08:08 PM, Giesin, Peter wrote:
> >
> >> The MultiLineIO is a BoundedSource and an extension of FileBasedSource.
> >> Where the FileBasedSource reads a single line at a time the MultiLineIO
> >> allows the user to define an arbitrary “message” delimiter. It then
> reads
> >> through the file, removing newlines, until the separator is read,
> finally
> >> returning the character sequence that is built.
> >>
> >>
> >>
> >> I believe it is already built using the new style but I will compare it
> >> to the BigTableIO to confirm that.
> >>
> >> Peter
> >>
> >> On 3/14/16, 1:50 PM, "Jean-Baptiste Onofré" <[email protected]> wrote:
> >>
> >> I second Eugene here.
> >>>
> >>> In the past, I developed some IOs using the "old style" (as did in the
> >>> PubSubIO). I'm now refactoring it to use the "new style".
> >>>
> >>> Regards
> >>> JB
> >>>
> >>> On 03/14/2016 06:47 PM, Eugene Kirpichov wrote:
> >>>
> >>>> Hi Peter,
> >>>> Looking forward to your PR. Please note that source classes are
> >>>> relatively
> >>>> tricky to develop, so would you mind briefly explaining what your
> source
> >>>> will do here over email, so that we hash out some possible issues
> early
> >>>> rather than in PR comments?
> >>>> Also note that now recommend to package IO connectors as PTransforms,
> >>>> making the PTransform class itself be a builder - while the
> Source/Sink
> >>>> classes should be kept package-private (rather than exposed to the
> >>>> user).
> >>>> For an example of a connector packaged in this style, see BigtableIO (
> >>>>
> >>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_GoogleCloudPlatform_DataflowJavaSDK_blob_master_sdk_src_main_java_com_google_cloud_dataflow_sdk_io_bigtable_BigtableIO.java&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=qJJMaoRlOHxy1MRcAwa7aIJxwGYJyUKL93FdO4jZr1I&e=
> >>>> ).
> >>>> The advantage is that this style allows you to restructure the
> >>>> connector or
> >>>> add additional transforms into its implementation if necessary,
> without
> >>>> changing the call sites. It might seem less important in case of a
> >>>> simple
> >>>> connector like reading lines from file, but it will become much more
> >>>> important with things like SplittableDoFn
> >>>> <
> >>>>
> https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_BEAM-2D65&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=POJMhWDTbkUnHHLnKcH9FtzeP-lrZkuGZG3YPNNhXSU&e=
> >>>> >.
> >>>>
> >>>> On Mon, Mar 14, 2016 at 10:29 AM Jean-Baptiste Onofré <
> [email protected]>
> >>>> wrote:
> >>>>
> >>>> Hi Peter,
> >>>>>
> >>>>> awesome !
> >>>>>
> >>>>> Yes, you can create the PR using the github mirror.
> >>>>>
> >>>>> Does your MultiLineIO use Bounded/Unbounded "new" classes ?
> >>>>>
> >>>>> Regards
> >>>>> JB
> >>>>>
> >>>>> On 03/14/2016 06:23 PM, Giesin, Peter wrote:
> >>>>>
> >>>>>> Hi all!
> >>>>>>
> >>>>>> I am looking to get involved in the project. I have a MultiLineIO
> >>>>>>
> >>>>> file-based source that I think would be useful. I know the project is
> >>>>> just
> >>>>> spinning up but can I simply clone the repo and create a PR for the
> >>>>> new IO?
> >>>>> Also looked over JIRA and there are some tickets I can help out with.
> >>>>>
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Peter Giesin
> >>>>>> [email protected]
> >>>>>>
> >>>>>>
> >>>>>> _____________
> >>>>>> The information contained in this message is proprietary and/or
> >>>>>>
> >>>>> confidential. If you are not the intended recipient, please: (i)
> >>>>> delete the
> >>>>> message and all copies; (ii) do not disclose, distribute or use the
> >>>>> message
> >>>>> in any manner; and (iii) notify the sender immediately. In addition,
> >>>>> please
> >>>>> be aware that any message addressed to our domain is subject to
> >>>>> archiving
> >>>>> and review by persons other than the intended recipient. Thank you.
> >>>>>
> >>>>>>
> >>>>>>
> >>>>> --
> >>>>> Jean-Baptiste Onofré
> >>>>> [email protected]
> >>>>>
> >>>>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
> >>>>> Talend -
> >>>>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
> >>>>>
> >>>>>
> >>>>
> >>> --
> >>> Jean-Baptiste Onofré
> >>> [email protected]
> >>>
> >>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__blog.nanthrax.net&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=f6FNnwvFjzBZnAIvDfndYuU_lAso931YU4yr4oSnypE&e=
> >>> Talend -
> >>>
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.talend.com&d=BQIDaQ&c=3BfiSO86x5iKjpl2b39jud9R1NrKYqPq2js90dwBswk&r=Qm-l_hW9ETnsf6X4GnnKezFfnAEwc328ni8ljHdGYjo&m=spZLCFrFYTtUSPsGFMTVvmXPyfW-dr7Uouq-4BtWaPQ&s=LtKQ-yfpvERysYJvdj3EP_VPA47BuNVkJ6hqfIW1RQM&e=
> >>>
> >>> _____________
> >>> The information contained in this message is proprietary and/or
> >>> confidential. If you are not the intended recipient, please: (i)
> delete the
> >>> message and all copies; (ii) do not disclose, distribute or use the
> message
> >>> in any manner; and (iii) notify the sender immediately. In addition,
> please
> >>> be aware that any message addressed to our domain is subject to
> archiving
> >>> and review by persons other than the intended recipient. Thank you.
> >>>
> >>
> >> _____________
> >> The information contained in this message is proprietary and/or
> >> confidential. If you are not the intended recipient, please: (i) delete
> the
> >> message and all copies; (ii) do not disclose, distribute or use the
> message
> >> in any manner; and (iii) notify the sender immediately. In addition,
> please
> >> be aware that any message addressed to our domain is subject to
> archiving
> >> and review by persons other than the intended recipient. Thank you.
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > [email protected]
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Reply via email to