Re: PCollection to PCollection Conversion

Dan Halperin Thu, 29 Dec 2016 13:47:12 -0800

On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <[email protected]>
wrote:


> I prefer JB's take. I think there should be three overloaded methods on the
> class. I like Vikas' name ToString. The methods for a simple conversion
> should be:
>
> ToString.strings() - Outputs the .toString() of the objects in the
> PCollection
> ToString.strings(String delimiter) - Outputs the .toString() of KVs, Lists,
> etc with the delimiter between every entry
> ToString.formatted(String format) - Outputs the formatted
> <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html>
> string
> with the object passed in. For objects made up of different parts like KVs,
> each one is passed in as separate toString() of a varargs.
>

Riffing a little, with some types:

ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo
that takes in a T and outputs T.toString().

ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that is
equivalent to a ParDo that takes in a KV<K,V> and outputs
kv.getKey().toString() + delimiter + kv.getValue().toString()

ToString.<T>iterable(String delimiter) -- PTransform<? extends Iterable<T>,
String> that is equivalent to a ParDo that takes in an Iterable<T> and
outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... +
delimiter + iterable[N-1]

ToString.<T>custom(SerializableFunction<T, String> formatter) ?

The last one is just MapElement.via, except you don't need to set the
output type.

I don't see a way to make the generic .formatted() that you propose that
just works with anything "made of different parts".

I think this adding too many overrides beyond "of" and "custom" is opening
up a Pandora's Box. the KV one might want to have left and right
delimiters, might want to take custom formatters for K and V, etc. etc. The
iterable one might want to have a special configuration for an empty
iterable. So I'm inclined towards simplicity with the awareness that
MapElements.via is just not that hard to use.

Dan


>
> I think doing these three methods would cover every simple and advanced
> "simple conversions." As JB says, we'll need other specific converters for
> other formats like XML.
>
> I'd really like to see this class in the next version of Beam. What does
> everyone think of the class name, methods name, and method operations so we
> can have Vikas finish up?
>
> Thanks,
>
> Jesse
>
> On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <[email protected]>
> wrote:
>
> > Hi Vikas,
> >
> > did you take a look on:
> >
> >
> > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/
> java/extensions/dataformat
> >
> > You can see KV2String and ToString could be part of this extension.
> > I'm also using JAXB for XML and Jackson for JSON
> > marshalling/unmarshalling. I'm planning to deal with Avro
> (IndexedRecord).
> >
> > Regards
> > JB
> >
> > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
> > > Hi All,
> > >
> > >   Not being aware of the discussion here, I sent out a PR
> > > <https://github.com/apache/beam/pull/1704> but JB and others directed
> > me to
> > > this thread. Having converted PCollection<T> to PCollection<String>
> > several
> > > times, I feel something like 'ToString' transform is common enough to
> be
> > > part of the core. What do you all think?
> > >
> > > Also, if someone else is already working on or interested in tackling
> > this,
> > > then I am happy to discard the PR.
> > >
> > > Regards,
> > > Vikas
> > >
> > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <[email protected]>
> wrote:
> > >
> > >> It seems that there were a lot of good points raised here, and I tend
> to
> > >> agree that something as trivial and lean as "ToString" should be a
> part
> > of
> > >> core.ake
> > >> I'm particularly fond of makeString(prefix, toString, suffix) in
> various
> > >> combinations (Scala-like).
> > >> For "fromString", I think JB has a good point leveraging JAXB and
> > Jackson -
> > >> though I think this should be in extensions as it is not as lean as
> > >> toString.
> > >>
> > >> Thanks,
> > >> Amit
> > >>
> > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <[email protected]
> >
> > >> wrote:
> > >>
> > >>> Hi Jesse,
> > >>>
> > >>> yes, I started something there (using JAXB and Jackson). Let me
> polish
> > >>> and push.
> > >>>
> > >>> Regards
> > >>> JB
> > >>>
> > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote:
> > >>>> I went through the string conversions. Do you have an example of
> > >> writing
> > >>>> out XML/JSON/etc too?
> > >>>>
> > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <
> [email protected]
> > >
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Jesse,
> > >>>>>
> > >>>>>
> > >>>>>
> > >>> https://github.com/jbonofre/incubator-beam/tree/
> DATAFORMAT/sdks/java/
> > >> extensions/dataformat
> > >>>>>
> > >>>>> it's very simple and stupid and of course not complete at all (I
> have
> > >>>>> other commits but not merged as they need some polishing), but as I
> > >>>>> said, it's a base of discussion.
> > >>>>>
> > >>>>> Regards
> > >>>>> JB
> > >>>>>
> > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote:
> > >>>>>> @jb Sounds good. Just let us know once you've pushed.
> > >>>>>>
> > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
> > >> [email protected]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> Good point Eugene.
> > >>>>>>>
> > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure
> > >>>>>>> extension). It's pretty stupid ;)
> > >>>>>>>
> > >>>>>>> But, you are right, depending the direction of such extension, it
> > >>> could
> > >>>>>>> cover more use cases (even if it's not my first intention ;)).
> > >>>>>>>
> > >>>>>>> Let me push the branch (pretty small) as an illustration, and in
> > the
> > >>>>>>> mean time, I'm preparing a document (more focused on the use
> > cases).
> > >>>>>>>
> > >>>>>>> WDYT ?
> > >>>>>>>
> > >>>>>>> Regards
> > >>>>>>> JB
> > >>>>>>>
> > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
> > >>>>>>>> Hi JB,
> > >>>>>>>> Depending on the scope of what you want to ultimately accomplish
> > >> with
> > >>>>>>> this
> > >>>>>>>> extension, I think it may make sense to write a proposal
> document
> > >> and
> > >>>>>>>> discuss it.
> > >>>>>>>> If it's just a collection of utility DoFn's for various
> > >> well-defined
> > >>>>>>>> source/target format pairs, then that's probably not needed, but
> > if
> > >>>>> it's
> > >>>>>>>> anything more, then I think it is.
> > >>>>>>>> That will help avoid a lot of churn if people propose reasonable
> > >>>>>>>> significant changes.
> > >>>>>>>>
> > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
> > >>> [email protected]
> > >>>>>>
> > >>>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my
> github
> > >>> and I
> > >>>>>>>>> will post on the dev mailing list when done.
> > >>>>>>>>>
> > >>>>>>>>> Regards
> > >>>>>>>>> JB
> > >>>>>>>>>
> > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote:
> > >>>>>>>>>> I want to bring this thread back up since we've had time to
> > think
> > >>>>> about
> > >>>>>>>>> it
> > >>>>>>>>>> more and make a plan.
> > >>>>>>>>>>
> > >>>>>>>>>> I think a format-specific converter will be more time
> consuming
> > >>> task
> > >>>>>>> than
> > >>>>>>>>>> we originally thought. It'd have to be a writer that takes
> > >> another
> > >>>>>>> writer
> > >>>>>>>>>> as a parameter.
> > >>>>>>>>>>
> > >>>>>>>>>> I think a string converter can be done as a simple transform.
> > >>>>>>>>>>
> > >>>>>>>>>> I think we should start with a simple string converter and
> plan
> > >>> for a
> > >>>>>>>>>> format-specific writer.
> > >>>>>>>>>>
> > >>>>>>>>>> What are your thoughts?
> > >>>>>>>>>>
> > >>>>>>>>>> Thanks,
> > >>>>>>>>>>
> > >>>>>>>>>> Jesse
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
> > >>>>> [email protected]
> > >>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I was thinking about what the outputs would look like last
> > >> night. I
> > >>>>>>>>>> realized that more complex formats like JSON and XML may or
> may
> > >> not
> > >>>>>>>>> output
> > >>>>>>>>>> the data in a valid format.
> > >>>>>>>>>>
> > >>>>>>>>>> Doing a direct conversion on unbounded collections would work
> > >> just
> > >>>>>>> fine.
> > >>>>>>>>>> They're self-contained. For writing out bounded collections,
> > >> that's
> > >>>>>>> where
> > >>>>>>>>>> we'll hit the issues. This changes the uber conversion
> transform
> > >>>>> into a
> > >>>>>>>>>> transform that needs to be a writer.
> > >>>>>>>>>>
> > >>>>>>>>>> If a transform executes a JSON conversion on a per element
> > basis,
> > >>>>> we'd
> > >>>>>>>>> get
> > >>>>>>>>>> this:
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> },
> > >>>>>>>>>>
> > >>>>>>>>>> That isn't valid JSON.
> > >>>>>>>>>>
> > >>>>>>>>>> The conversion transform would need to know do several things
> > >> when
> > >>>>>>>>> writing
> > >>>>>>>>>> out a file. It would need to add brackets for an array. Now we
> > >>> have:
> > >>>>>>>>>> [
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> },
> > >>>>>>>>>> ]
> > >>>>>>>>>>
> > >>>>>>>>>> We still don't have valid JSON. We have to remove the last
> comma
> > >> or
> > >>>>>>> have
> > >>>>>>>>>> the uber transform start putting in the commas, except for the
> > >> last
> > >>>>>>>>> element.
> > >>>>>>>>>>
> > >>>>>>>>>> [
> > >>>>>>>>>> {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }, {
> > >>>>>>>>>> "key": "value"
> > >>>>>>>>>> }
> > >>>>>>>>>> ]
> > >>>>>>>>>>
> > >>>>>>>>>> Only by doing this do we have valid JSON.
> > >>>>>>>>>>
> > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers
> > >> require
> > >>> a
> > >>>>>>> root
> > >>>>>>>>>> element for everything. The uber transform would have to put
> the
> > >>> root
> > >>>>>>>>>> element tags at the beginning and end of the file.
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
> > >>> [email protected]>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>> I would love to see a lean core and abundant Transforms at the
> > >> same
> > >>>>>>> time.
> > >>>>>>>>>>
> > >>>>>>>>>> Maybe we can look at what Confluent <
> > >>> https://github.com/confluentinc
> > >>>>>>
> > >>>>>>>>> does
> > >>>>>>>>>> for kafka-connect. They have official extensions support for
> > >> JDBC,
> > >>>>> HDFS
> > >>>>>>>>> and
> > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They put
> > >> them
> > >>>>>>> along
> > >>>>>>>>>> with other community extensions on
> > >>>>>>>>>> https://www.confluent.io/product/connectors/ for visibility.
> > >>>>>>>>>>
> > >>>>>>>>>> Although not a commercial company, can we have a GitHub user
> > like
> > >>>>>>>>>> beam-community to host projects we build around beam but not
> > >>> suitable
> > >>>>>>> for
> > >>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we
> may
> > >>> have
> > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for
> > algebra
> > >>>>>>>>> operations
> > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning.
> > Also,
> > >>>>> there
> > >>>>>>>>>> will will be beam related projects elsewhere maintained by
> other
> > >>>>>>>>>> communities. We can put all of them on the beam-website or
> like
> > >>> spark
> > >>>>>>>>>> packages as mentioned by Amit.
> > >>>>>>>>>>
> > >>>>>>>>>> My $0.02
> > >>>>>>>>>> Manu
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
> > >>>>> <[email protected]
> > >>>>>>>>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit
> > >>> from a
> > >>>>>>>>> place
> > >>>>>>>>>>> for miscellaneous non-core helper transformations.
> > >>>>>>>>>>>
> > >>>>>>>>>>> We have sdks/java/extensions but it is organized as separate
> > >>>>>>> artifacts.
> > >>>>>>>>> I
> > >>>>>>>>>>> think that is fine, considering the nature of Join and
> > >> SortValues.
> > >>>>> But
> > >>>>>>>>> for
> > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny transform
> > is
> > >>> too
> > >>>>>>>>> much
> > >>>>>>>>>>> overhead. It also seems unlikely that we will have enough
> > >>>>> commonality
> > >>>>>>>>>> among
> > >>>>>>>>>>> the transforms to call the artifact anything other than [some
> > >>>>> synonym
> > >>>>>>>>> for]
> > >>>>>>>>>>> "miscellaneous".
> > >>>>>>>>>>>
> > >>>>>>>>>>> I wouldn't want to take this too far - even though the SDK
> many
> > >>>>>>>>>> transforms*
> > >>>>>>>>>>> that are not required for the model [1], I like that the SDK
> > >>>>> artifact
> > >>>>>>>>> has
> > >>>>>>>>>>> everything a user might need in their "getting started" phase
> > of
> > >>>>> use.
> > >>>>>>>>> This
> > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core
> and
> > >>> Sum
> > >>>>> is
> > >>>>>>>>>> not)
> > >>>>>>>>>>> plus the difficulty of judging which transforms go where, are
> > >>>>> probably
> > >>>>>>>>> why
> > >>>>>>>>>>> we have them mostly all in one place.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Models to look at, off the top of my head, include Pig's
> > >> PiggyBank
> > >>>>> and
> > >>>>>>>>>>> Apex's Malhar. These have different levels of support
> implied.
> > >>>>> Others?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Kenn
> > >>>>>>>>>>>
> > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
> > >>>>> Filter,
> > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
> > >>> Values,
> > >>>>>>>>>> KvSwap,
> > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys,
> > >>> WithTimestamps
> > >>>>>>>>>>>
> > >>>>>>>>>>> * at least they are separate classes and not methods on
> > >>> PCollection
> > >>>>>>> :-)
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <
> > [email protected]
> > >>>
> > >>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject
> > >>> back.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for
> those
> > >>>>>>>>>> transforms
> > >>>>>>>>>>>> that are not core enough to be part of the sdk, but that we
> > all
> > >>> end
> > >>>>>>> up
> > >>>>>>>>>>>> re-writing somehow.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> This is a needed improvement to be more developer friendly,
> > but
> > >>>>> also
> > >>>>>>> as
> > >>>>>>>>>> a
> > >>>>>>>>>>>> reference of good practices of Beam development, and for
> this
> > >>>>> reason
> > >>>>>>> I
> > >>>>>>>>>>>> agree with JB that at this moment it would be better for
> these
> > >>>>>>>>>> transforms
> > >>>>>>>>>>>> to reside in the Beam repository at least for visibility
> > >> reasons.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> One additional question is if these transforms represent a
> > >>>>> different
> > >>>>>>>>> DSL
> > >>>>>>>>>>> or
> > >>>>>>>>>>>> if those could be grouped with the current extensions (e.g.
> > >> Join
> > >>>>> and
> > >>>>>>>>>>>> SortValues) into something more general that we as a
> community
> > >>>>> could
> > >>>>>>>>>>>> maintain, but well even if it is not the case, it would be
> > >> really
> > >>>>>>> nice
> > >>>>>>>>>> to
> > >>>>>>>>>>>> start working on something like this.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Ismaël Mejía
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
> > >>>>>>> [email protected]
> > >>>>>>>>>>
> > >>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to host
> > >>>>>>>>>>>>> connectors/transforms for Spark and Flink.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it
> makes
> > >>> sense
> > >>>>>>>>>>>>> directly in the core.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the
> > >>> technical
> > >>>>>>>>>>> vision
> > >>>>>>>>>>>>> document.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> Regards
> > >>>>>>>>>>>>> JB
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while
> > Luke's
> > >>> and
> > >>>>>>>>>>>>>> Kenneth's
> > >>>>>>>>>>>>>> worries about committing users to specific implementations
> > is
> > >>> in
> > >>>>>>>>>>> place.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for useful
> > >>>>> libraries
> > >>>>>>>>>>> that
> > >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark
> > >> project:
> > >>>>>>>>>>>>>> https://spark-packages.org/.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve both
> > >> users
> > >>>>>>> quick
> > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more
> "enabling" ?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
> > >>>>>>>>>> <[email protected]
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to
> have
> > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly
> > >>> indicate
> > >>>>>>> its
> > >>>>>>>>>>>>>>> limited
> > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace,
> > but
> > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it
> should
> > >> be
> > >>>>>>> pretty
> > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire format.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The broader question of representing data in JSON or XML,
> > >> etc,
> > >>>>> is
> > >>>>>>>>>>>> already
> > >>>>>>>>>>>>>>> the subject of many mature libraries which are already
> easy
> > >> to
> > >>>>> use
> > >>>>>>>>>>> with
> > >>>>>>>>>>>>>>> Beam.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit
> > >>>>> coercions
> > >>>>>>>>>>> seems
> > >>>>>>>>>>>>>>> like it is also already addressed in many ways elsewhere.
> > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as
> > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable
> > >>> approaches,
> > >>>>>>> and
> > >>>>>>>>>>> we
> > >>>>>>>>>>>>>>> shouldn't commit our users to one of them.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
> > >>>>>>>>>>> <[email protected]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> The suggestions you give seem good except for the the XML
> > >>> cases.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line
> similar
> > >> to
> > >>>>> the
> > >>>>>>>>>>> JSON
> > >>>>>>>>>>>>>>>> examples you have been giving.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
> > >>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was
> > >> more
> > >>>>>>> think
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> that
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It
> > >> should
> > >>>>>>>>>> handle
> > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give
> > someone
> > >>>>>>>>>>> something
> > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up
> writing
> > >>> your
> > >>>>>>> own
> > >>>>>>>>>>>> code
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> handle it anyway.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a
> > >> method
> > >>>>> and
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>> resulting string output:
> > >>>>>>>>>>>>>>>>> *Stringify.toJSON()*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> {"key": "value"}
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> ["one", "two", "three"]
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> <rootelement key=value />
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> <rootelement>
> > >>>>>>>>>>>>>>>>>   <item>one</item>
> > >>>>>>>>>>>>>>>>>   <item>two</item>
> > >>>>>>>>>>>>>>>>>   <item>three</item>
> > >>>>>>>>>>>>>>>>> </rootelement>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With KV:
> > >>>>>>>>>>>>>>>>> key,value
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> With Iterables:
> > >>>>>>>>>>>>>>>>> one,two,three
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance between
> > >>> reusable
> > >>>>>>>>>> code
> > >>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting?
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
> > >>>>>>>>>>> <[email protected]
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment
> in
> > >>>>> TextIO,
> > >>>>>>>>>>>> people
> > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not
> > >> supported.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact
> that
> > >>> the
> > >>>>>>>>>> input
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> format
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about
> > >> using
> > >>> KV
> > >>>>>>>>>> with
> > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed
> input
> > >>>>> format
> > >>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> still
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> would require to write a type conversion function, this
> > >> time
> > >>>>>>> from
> > >>>>>>>>>>> KV
> > >>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
> > >>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> Lukasz,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for
> > >>> TextIO.Write.
> > >>>>>>> For
> > >>>>>>>>>>> CSV
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> call would look like:
> > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n");
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix,
> > >>> delimiter,
> > >>>>>>>>>>>> suffix).
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The code would be something like:
> > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> for (Item item : list) {
> > >>>>>>>>>>>>>>>>>>   buffer.append(item.toString());
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>   if(notLast) {
> > >>>>>>>>>>>>>>>>>>     buffer.append(delimiter);
> > >>>>>>>>>>>>>>>>>>   }
> > >>>>>>>>>>>>>>>>>> }
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> buffer.append(suffix);
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> c.output(buffer.toString());
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and
> other
> > >>>>>>> formats
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> without
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be done
> > >> for
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> TextIO.Write.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
> > >>>>>>>>>>>> <[email protected]
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> The conversion from object to string will have uses
> > >> outside
> > >>>>> of
> > >>>>>>>>>>> just
> > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want
> to
> > >>> have
> > >>>>> a
> > >>>>>>>>>>> ParDo
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> do
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> conversion.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you
> > >>>>> consider
> > >>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> subset
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width
> > >> fields,
> > >>>>> or
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> escaping
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should
> be
> > >>>>> placed
> > >>>>>>> at
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> top.
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Having all these format conversions within
> TextIO.Write
> > >>>>> seems
> > >>>>>>>>>>> like
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> lot
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> of
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should just
> > >> focus
> > >>>>> on
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> writing
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> to
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> files.
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> [email protected]>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> wrote:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing
> list.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a
> > >>>>> PCollection<KV>
> > >>>>>>> to
> > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually
> > convert
> > >>> the
> > >>>>>>> KV
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> to a
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> String:
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>>>>> *                .apply(MapElements.via((KV<String,
> > >> Long>
> > >>>>>>>>>> count)
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> ->*
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *                            count.getKey() + ":" +
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> count.getValue()*
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> *                        ).withOutputType(
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))*
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> This code really should be something like:
> > >>>>>>>>>>>>>>>>>>>>         p
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>  .apply(TextIO.Read.from("playing_cards.tsv"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Regex.split("\\W+"))
> > >>>>>>>>>>>>>>>>>>>>                 .apply(Count.perElement())
> > >>>>>>>>>>>>>>>>>>>> *                .apply(ToString.stringify())*
> > >>>>>>>>>>>>>>>>>>>>                 .apply(TextIO.Write.to
> > >>>>>>>>> ("output/stringcounts"));
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> To summarize the discussion:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>    - JA: Add a method to StringDelegateCoder to
> output
> > >>> any
> > >>>>> KV
> > >>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> list
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    - JA and DH: Add a SimpleFunction that takes an type
> > >> and
> > >>>>> runs
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> toString()
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>    on it:
> > >>>>>>>>>>>>>>>>>>>>    class ToStringFn<InputT> extends
> > >>> SimpleFunction<InputT,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> String>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> {
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>        public static String apply(InputT input) {
> > >>>>>>>>>>>>>>>>>>>>            return input.toString();
> > >>>>>>>>>>>>>>>>>>>>        }
> > >>>>>>>>>>>>>>>>>>>>    }
> > >>>>>>>>>>>>>>>>>>>>    - JB: Add a general purpose type converter like
> in
> > >>>>> Apache
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> Camel.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>    - JA: Add Object support to TextIO.Write that would
> > >> write
> > >>>>> out
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>    toString of any Object.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> My thoughts:
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed
> > >> when
> > >>>>>>>>>> you're
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> using
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only
> work
> > >> in
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> certain
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> cases
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code format
> > the
> > >>>>>>> strings
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> the
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>> way
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> you want them?
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object
> > >> support
> > >>> to
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> TextIO.Write
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> or
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an
> > argument.
> > >>>>>>> Making
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter
> (and
> > >>>>>>> perhaps
> > >>>>>>>>>> a
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>> prefix
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and
> > cases.
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Thanks,
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>> Jesse
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>> --
> > >>>>>>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>>>>>> [email protected]
> > >>>>>>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> --
> > >>>>>>>>> Jean-Baptiste Onofré
> > >>>>>>>>> [email protected]
> > >>>>>>>>> http://blog.nanthrax.net
> > >>>>>>>>> Talend - http://www.talend.com
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Jean-Baptiste Onofré
> > >>>>>>> [email protected]
> > >>>>>>> http://blog.nanthrax.net
> > >>>>>>> Talend - http://www.talend.com
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Jean-Baptiste Onofré
> > >>>>> [email protected]
> > >>>>> http://blog.nanthrax.net
> > >>>>> Talend - http://www.talend.com
> > >>>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Jean-Baptiste Onofré
> > >>> [email protected]
> > >>> http://blog.nanthrax.net
> > >>> Talend - http://www.talend.com
> > >>>
> > >>
> > >
> >
> > --
> > Jean-Baptiste Onofré
> > [email protected]
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>

Re: PCollection to PCollection Conversion

Reply via email to