On Thu, Dec 29, 2016 at 1:36 PM, Jesse Anderson <[email protected]> wrote:
> I prefer JB's take. I think there should be three overloaded methods on the > class. I like Vikas' name ToString. The methods for a simple conversion > should be: > > ToString.strings() - Outputs the .toString() of the objects in the > PCollection > ToString.strings(String delimiter) - Outputs the .toString() of KVs, Lists, > etc with the delimiter between every entry > ToString.formatted(String format) - Outputs the formatted > <https://docs.oracle.com/javase/8/docs/api/java/util/Formatter.html> > string > with the object passed in. For objects made up of different parts like KVs, > each one is passed in as separate toString() of a varargs. > Riffing a little, with some types: ToString.<T>of() -- PTransform<T, String> that is equivalent to a ParDo that takes in a T and outputs T.toString(). ToString.<K,V>kv(String delimiter) -- PTransform<KV<K, V>, String> that is equivalent to a ParDo that takes in a KV<K,V> and outputs kv.getKey().toString() + delimiter + kv.getValue().toString() ToString.<T>iterable(String delimiter) -- PTransform<? extends Iterable<T>, String> that is equivalent to a ParDo that takes in an Iterable<T> and outputs the iterable[0] + delimiter + iterable[1] + delimiter + ... + delimiter + iterable[N-1] ToString.<T>custom(SerializableFunction<T, String> formatter) ? The last one is just MapElement.via, except you don't need to set the output type. I don't see a way to make the generic .formatted() that you propose that just works with anything "made of different parts". I think this adding too many overrides beyond "of" and "custom" is opening up a Pandora's Box. the KV one might want to have left and right delimiters, might want to take custom formatters for K and V, etc. etc. The iterable one might want to have a special configuration for an empty iterable. So I'm inclined towards simplicity with the awareness that MapElements.via is just not that hard to use. Dan > > I think doing these three methods would cover every simple and advanced > "simple conversions." As JB says, we'll need other specific converters for > other formats like XML. > > I'd really like to see this class in the next version of Beam. What does > everyone think of the class name, methods name, and method operations so we > can have Vikas finish up? > > Thanks, > > Jesse > > On Wed, Dec 28, 2016 at 12:28 PM Jean-Baptiste Onofré <[email protected]> > wrote: > > > Hi Vikas, > > > > did you take a look on: > > > > > > https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/ > java/extensions/dataformat > > > > You can see KV2String and ToString could be part of this extension. > > I'm also using JAXB for XML and Jackson for JSON > > marshalling/unmarshalling. I'm planning to deal with Avro > (IndexedRecord). > > > > Regards > > JB > > > > On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote: > > > Hi All, > > > > > > Not being aware of the discussion here, I sent out a PR > > > <https://github.com/apache/beam/pull/1704> but JB and others directed > > me to > > > this thread. Having converted PCollection<T> to PCollection<String> > > several > > > times, I feel something like 'ToString' transform is common enough to > be > > > part of the core. What do you all think? > > > > > > Also, if someone else is already working on or interested in tackling > > this, > > > then I am happy to discard the PR. > > > > > > Regards, > > > Vikas > > > > > > On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <[email protected]> > wrote: > > > > > >> It seems that there were a lot of good points raised here, and I tend > to > > >> agree that something as trivial and lean as "ToString" should be a > part > > of > > >> core.ake > > >> I'm particularly fond of makeString(prefix, toString, suffix) in > various > > >> combinations (Scala-like). > > >> For "fromString", I think JB has a good point leveraging JAXB and > > Jackson - > > >> though I think this should be in extensions as it is not as lean as > > >> toString. > > >> > > >> Thanks, > > >> Amit > > >> > > >> On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <[email protected] > > > > >> wrote: > > >> > > >>> Hi Jesse, > > >>> > > >>> yes, I started something there (using JAXB and Jackson). Let me > polish > > >>> and push. > > >>> > > >>> Regards > > >>> JB > > >>> > > >>> On 11/29/2016 10:00 PM, Jesse Anderson wrote: > > >>>> I went through the string conversions. Do you have an example of > > >> writing > > >>>> out XML/JSON/etc too? > > >>>> > > >>>> On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré < > [email protected] > > > > > >>>> wrote: > > >>>> > > >>>>> Hi Jesse, > > >>>>> > > >>>>> > > >>>>> > > >>> https://github.com/jbonofre/incubator-beam/tree/ > DATAFORMAT/sdks/java/ > > >> extensions/dataformat > > >>>>> > > >>>>> it's very simple and stupid and of course not complete at all (I > have > > >>>>> other commits but not merged as they need some polishing), but as I > > >>>>> said, it's a base of discussion. > > >>>>> > > >>>>> Regards > > >>>>> JB > > >>>>> > > >>>>> On 11/29/2016 09:23 PM, Jesse Anderson wrote: > > >>>>>> @jb Sounds good. Just let us know once you've pushed. > > >>>>>> > > >>>>>> On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré < > > >> [email protected]> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> Good point Eugene. > > >>>>>>> > > >>>>>>> Right now, it's a DoFn collection to experiment a bit (a pure > > >>>>>>> extension). It's pretty stupid ;) > > >>>>>>> > > >>>>>>> But, you are right, depending the direction of such extension, it > > >>> could > > >>>>>>> cover more use cases (even if it's not my first intention ;)). > > >>>>>>> > > >>>>>>> Let me push the branch (pretty small) as an illustration, and in > > the > > >>>>>>> mean time, I'm preparing a document (more focused on the use > > cases). > > >>>>>>> > > >>>>>>> WDYT ? > > >>>>>>> > > >>>>>>> Regards > > >>>>>>> JB > > >>>>>>> > > >>>>>>> On 11/29/2016 08:47 PM, Eugene Kirpichov wrote: > > >>>>>>>> Hi JB, > > >>>>>>>> Depending on the scope of what you want to ultimately accomplish > > >> with > > >>>>>>> this > > >>>>>>>> extension, I think it may make sense to write a proposal > document > > >> and > > >>>>>>>> discuss it. > > >>>>>>>> If it's just a collection of utility DoFn's for various > > >> well-defined > > >>>>>>>> source/target format pairs, then that's probably not needed, but > > if > > >>>>> it's > > >>>>>>>> anything more, then I think it is. > > >>>>>>>> That will help avoid a lot of churn if people propose reasonable > > >>>>>>>> significant changes. > > >>>>>>>> > > >>>>>>>> On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré < > > >>> [email protected] > > >>>>>> > > >>>>>>>> wrote: > > >>>>>>>> > > >>>>>>>>> By the way Jesse, I gonna push my DATAFORMAT branch on my > github > > >>> and I > > >>>>>>>>> will post on the dev mailing list when done. > > >>>>>>>>> > > >>>>>>>>> Regards > > >>>>>>>>> JB > > >>>>>>>>> > > >>>>>>>>> On 11/29/2016 07:01 PM, Jesse Anderson wrote: > > >>>>>>>>>> I want to bring this thread back up since we've had time to > > think > > >>>>> about > > >>>>>>>>> it > > >>>>>>>>>> more and make a plan. > > >>>>>>>>>> > > >>>>>>>>>> I think a format-specific converter will be more time > consuming > > >>> task > > >>>>>>> than > > >>>>>>>>>> we originally thought. It'd have to be a writer that takes > > >> another > > >>>>>>> writer > > >>>>>>>>>> as a parameter. > > >>>>>>>>>> > > >>>>>>>>>> I think a string converter can be done as a simple transform. > > >>>>>>>>>> > > >>>>>>>>>> I think we should start with a simple string converter and > plan > > >>> for a > > >>>>>>>>>> format-specific writer. > > >>>>>>>>>> > > >>>>>>>>>> What are your thoughts? > > >>>>>>>>>> > > >>>>>>>>>> Thanks, > > >>>>>>>>>> > > >>>>>>>>>> Jesse > > >>>>>>>>>> > > >>>>>>>>>> On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson < > > >>>>> [email protected] > > >>>>>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> I was thinking about what the outputs would look like last > > >> night. I > > >>>>>>>>>> realized that more complex formats like JSON and XML may or > may > > >> not > > >>>>>>>>> output > > >>>>>>>>>> the data in a valid format. > > >>>>>>>>>> > > >>>>>>>>>> Doing a direct conversion on unbounded collections would work > > >> just > > >>>>>>> fine. > > >>>>>>>>>> They're self-contained. For writing out bounded collections, > > >> that's > > >>>>>>> where > > >>>>>>>>>> we'll hit the issues. This changes the uber conversion > transform > > >>>>> into a > > >>>>>>>>>> transform that needs to be a writer. > > >>>>>>>>>> > > >>>>>>>>>> If a transform executes a JSON conversion on a per element > > basis, > > >>>>> we'd > > >>>>>>>>> get > > >>>>>>>>>> this: > > >>>>>>>>>> { > > >>>>>>>>>> "key": "value" > > >>>>>>>>>> }, { > > >>>>>>>>>> "key": "value" > > >>>>>>>>>> }, > > >>>>>>>>>> > > >>>>>>>>>> That isn't valid JSON. > > >>>>>>>>>> > > >>>>>>>>>> The conversion transform would need to know do several things > > >> when > > >>>>>>>>> writing > > >>>>>>>>>> out a file. It would need to add brackets for an array. Now we > > >>> have: > > >>>>>>>>>> [ > > >>>>>>>>>> { > > >>>>>>>>>> "key": "value" > > >>>>>>>>>> }, { > > >>>>>>>>>> "key": "value" > > >>>>>>>>>> }, > > >>>>>>>>>> ] > > >>>>>>>>>> > > >>>>>>>>>> We still don't have valid JSON. We have to remove the last > comma > > >> or > > >>>>>>> have > > >>>>>>>>>> the uber transform start putting in the commas, except for the > > >> last > > >>>>>>>>> element. > > >>>>>>>>>> > > >>>>>>>>>> [ > > >>>>>>>>>> { > > >>>>>>>>>> "key": "value" > > >>>>>>>>>> }, { > > >>>>>>>>>> "key": "value" > > >>>>>>>>>> } > > >>>>>>>>>> ] > > >>>>>>>>>> > > >>>>>>>>>> Only by doing this do we have valid JSON. > > >>>>>>>>>> > > >>>>>>>>>> I'd argue we'd have a similar issue with XML. Some parsers > > >> require > > >>> a > > >>>>>>> root > > >>>>>>>>>> element for everything. The uber transform would have to put > the > > >>> root > > >>>>>>>>>> element tags at the beginning and end of the file. > > >>>>>>>>>> > > >>>>>>>>>> On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang < > > >>> [email protected]> > > >>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>> I would love to see a lean core and abundant Transforms at the > > >> same > > >>>>>>> time. > > >>>>>>>>>> > > >>>>>>>>>> Maybe we can look at what Confluent < > > >>> https://github.com/confluentinc > > >>>>>> > > >>>>>>>>> does > > >>>>>>>>>> for kafka-connect. They have official extensions support for > > >> JDBC, > > >>>>> HDFS > > >>>>>>>>> and > > >>>>>>>>>> ElasticSearch under https://github.com/confluentinc. They put > > >> them > > >>>>>>> along > > >>>>>>>>>> with other community extensions on > > >>>>>>>>>> https://www.confluent.io/product/connectors/ for visibility. > > >>>>>>>>>> > > >>>>>>>>>> Although not a commercial company, can we have a GitHub user > > like > > >>>>>>>>>> beam-community to host projects we build around beam but not > > >>> suitable > > >>>>>>> for > > >>>>>>>>>> https://github.com/apache/incubator-beam. In the future, we > may > > >>> have > > >>>>>>>>>> beam-algebra like http://github.com/twitter/algebird for > > algebra > > >>>>>>>>> operations > > >>>>>>>>>> and beam-ml / beam-dl for machine learning / deep learning. > > Also, > > >>>>> there > > >>>>>>>>>> will will be beam related projects elsewhere maintained by > other > > >>>>>>>>>> communities. We can put all of them on the beam-website or > like > > >>> spark > > >>>>>>>>>> packages as mentioned by Amit. > > >>>>>>>>>> > > >>>>>>>>>> My $0.02 > > >>>>>>>>>> Manu > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles > > >>>>> <[email protected] > > >>>>>>>> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> On this point from Amit and Ismaël, I agree: we could benefit > > >>> from a > > >>>>>>>>> place > > >>>>>>>>>>> for miscellaneous non-core helper transformations. > > >>>>>>>>>>> > > >>>>>>>>>>> We have sdks/java/extensions but it is organized as separate > > >>>>>>> artifacts. > > >>>>>>>>> I > > >>>>>>>>>>> think that is fine, considering the nature of Join and > > >> SortValues. > > >>>>> But > > >>>>>>>>> for > > >>>>>>>>>>> simpler transforms, Importing one artifact per tiny transform > > is > > >>> too > > >>>>>>>>> much > > >>>>>>>>>>> overhead. It also seems unlikely that we will have enough > > >>>>> commonality > > >>>>>>>>>> among > > >>>>>>>>>>> the transforms to call the artifact anything other than [some > > >>>>> synonym > > >>>>>>>>> for] > > >>>>>>>>>>> "miscellaneous". > > >>>>>>>>>>> > > >>>>>>>>>>> I wouldn't want to take this too far - even though the SDK > many > > >>>>>>>>>> transforms* > > >>>>>>>>>>> that are not required for the model [1], I like that the SDK > > >>>>> artifact > > >>>>>>>>> has > > >>>>>>>>>>> everything a user might need in their "getting started" phase > > of > > >>>>> use. > > >>>>>>>>> This > > >>>>>>>>>>> user-friendliness (the user doesn't care that ParDo is core > and > > >>> Sum > > >>>>> is > > >>>>>>>>>> not) > > >>>>>>>>>>> plus the difficulty of judging which transforms go where, are > > >>>>> probably > > >>>>>>>>> why > > >>>>>>>>>>> we have them mostly all in one place. > > >>>>>>>>>>> > > >>>>>>>>>>> Models to look at, off the top of my head, include Pig's > > >> PiggyBank > > >>>>> and > > >>>>>>>>>>> Apex's Malhar. These have different levels of support > implied. > > >>>>> Others? > > >>>>>>>>>>> > > >>>>>>>>>>> Kenn > > >>>>>>>>>>> > > >>>>>>>>>>> [1] ApproximateQuantiles, ApproximateUnique, Count, Distinct, > > >>>>> Filter, > > >>>>>>>>>>> FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min, > > >>> Values, > > >>>>>>>>>> KvSwap, > > >>>>>>>>>>> Partition, Regex, Sample, Sum, Top, Values, WithKeys, > > >>> WithTimestamps > > >>>>>>>>>>> > > >>>>>>>>>>> * at least they are separate classes and not methods on > > >>> PCollection > > >>>>>>> :-) > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía < > > [email protected] > > >>> > > >>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>>> Nice discussion, and thanks Jesse for bringing this subject > > >>> back. > > >>>>>>>>>>>> > > >>>>>>>>>>>> I agree 100% with Amit and the idea of having a home for > those > > >>>>>>>>>> transforms > > >>>>>>>>>>>> that are not core enough to be part of the sdk, but that we > > all > > >>> end > > >>>>>>> up > > >>>>>>>>>>>> re-writing somehow. > > >>>>>>>>>>>> > > >>>>>>>>>>>> This is a needed improvement to be more developer friendly, > > but > > >>>>> also > > >>>>>>> as > > >>>>>>>>>> a > > >>>>>>>>>>>> reference of good practices of Beam development, and for > this > > >>>>> reason > > >>>>>>> I > > >>>>>>>>>>>> agree with JB that at this moment it would be better for > these > > >>>>>>>>>> transforms > > >>>>>>>>>>>> to reside in the Beam repository at least for visibility > > >> reasons. > > >>>>>>>>>>>> > > >>>>>>>>>>>> One additional question is if these transforms represent a > > >>>>> different > > >>>>>>>>> DSL > > >>>>>>>>>>> or > > >>>>>>>>>>>> if those could be grouped with the current extensions (e.g. > > >> Join > > >>>>> and > > >>>>>>>>>>>> SortValues) into something more general that we as a > community > > >>>>> could > > >>>>>>>>>>>> maintain, but well even if it is not the case, it would be > > >> really > > >>>>>>> nice > > >>>>>>>>>> to > > >>>>>>>>>>>> start working on something like this. > > >>>>>>>>>>>> > > >>>>>>>>>>>> Ismaël Mejía > > >>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>>> On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré < > > >>>>>>> [email protected] > > >>>>>>>>>> > > >>>>>>>>>>>> wrote: > > >>>>>>>>>>>> > > >>>>>>>>>>>>> Related to spark-package, we also have Apache Bahir to host > > >>>>>>>>>>>>> connectors/transforms for Spark and Flink. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> IMHO, right now, Beam should host this, not sure if it > makes > > >>> sense > > >>>>>>>>>>>>> directly in the core. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> It reminds me the "Integration" DSL we discussed in the > > >>> technical > > >>>>>>>>>>> vision > > >>>>>>>>>>>>> document. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Regards > > >>>>>>>>>>>>> JB > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> On 11/09/2016 11:17 AM, Amit Sela wrote: > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>> I think Jesse has a very good point on one hand, while > > Luke's > > >>> and > > >>>>>>>>>>>>>> Kenneth's > > >>>>>>>>>>>>>> worries about committing users to specific implementations > > is > > >>> in > > >>>>>>>>>>> place. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> The Spark community has a 3rd party repository for useful > > >>>>> libraries > > >>>>>>>>>>> that > > >>>>>>>>>>>>>> for various reasons are not a part of the Apache Spark > > >> project: > > >>>>>>>>>>>>>> https://spark-packages.org/. > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> Maybe a "common-transformations" package would serve both > > >> users > > >>>>>>> quick > > >>>>>>>>>>>>>> ramp-up and ease-of-use while keeping Beam more > "enabling" ? > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles > > >>>>>>>>>> <[email protected] > > >>>>>>>>>>>> > > >>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>>> It seems useful for small scale debugging / demoing to > have > > >>>>>>>>>>>>>>> Dump.toString(). I think it should be named to clearly > > >>> indicate > > >>>>>>> its > > >>>>>>>>>>>>>>> limited > > >>>>>>>>>>>>>>> scope. Maybe other stuff could go in the Dump namespace, > > but > > >>>>>>>>>>>>>>> "Dump.toJson()" would be for humans to read - so it > should > > >> be > > >>>>>>> pretty > > >>>>>>>>>>>>>>> printed, not treated as a machine-to-machine wire format. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The broader question of representing data in JSON or XML, > > >> etc, > > >>>>> is > > >>>>>>>>>>>> already > > >>>>>>>>>>>>>>> the subject of many mature libraries which are already > easy > > >> to > > >>>>> use > > >>>>>>>>>>> with > > >>>>>>>>>>>>>>> Beam. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The more esoteric practice of implicit or semi-implicit > > >>>>> coercions > > >>>>>>>>>>> seems > > >>>>>>>>>>>>>>> like it is also already addressed in many ways elsewhere. > > >>>>>>>>>>>>>>> Transform.via(TypeConverter) is basically the same as > > >>>>>>>>>>>>>>> MapElements.via(<lambda>) and also easy to use with Beam. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> In both of the last cases, there are many reasonable > > >>> approaches, > > >>>>>>> and > > >>>>>>>>>>> we > > >>>>>>>>>>>>>>> shouldn't commit our users to one of them. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik > > >>>>>>>>>>> <[email protected] > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> The suggestions you give seem good except for the the XML > > >>> cases. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> Might want to have the XML be a document per line > similar > > >> to > > >>>>> the > > >>>>>>>>>>> JSON > > >>>>>>>>>>>>>>>> examples you have been giving. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson < > > >>>>>>>>>>>> [email protected]> > > >>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> @lukasz Agreed there would have to be KV handling. I was > > >> more > > >>>>>>> think > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> that > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> whatever the addition, it shouldn't just handle KV. It > > >> should > > >>>>>>>>>> handle > > >>>>>>>>>>>>>>>>> Iterables, Lists, Sets, and KVs. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> For JSON and XML, I wonder if we'd be able to give > > someone > > >>>>>>>>>>> something > > >>>>>>>>>>>>>>>>> general purpose enough that you would just end up > writing > > >>> your > > >>>>>>> own > > >>>>>>>>>>>> code > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> handle it anyway. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Here are some ideas on what it could look like with a > > >> method > > >>>>> and > > >>>>>>>>>>> the > > >>>>>>>>>>>>>>>>> resulting string output: > > >>>>>>>>>>>>>>>>> *Stringify.toJSON()* > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> With KV: > > >>>>>>>>>>>>>>>>> {"key": "value"} > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> With Iterables: > > >>>>>>>>>>>>>>>>> ["one", "two", "three"] > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> *Stringify.toXML("rootelement")* > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> With KV: > > >>>>>>>>>>>>>>>>> <rootelement key=value /> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> With Iterables: > > >>>>>>>>>>>>>>>>> <rootelement> > > >>>>>>>>>>>>>>>>> <item>one</item> > > >>>>>>>>>>>>>>>>> <item>two</item> > > >>>>>>>>>>>>>>>>> <item>three</item> > > >>>>>>>>>>>>>>>>> </rootelement> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> *Stringify.toDelimited(",")* > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> With KV: > > >>>>>>>>>>>>>>>>> key,value > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> With Iterables: > > >>>>>>>>>>>>>>>>> one,two,three > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Do you think that would strike a good balance between > > >>> reusable > > >>>>>>>>>> code > > >>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>> writing your own for more difficult formatting? > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Jesse > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik > > >>>>>>>>>>> <[email protected] > > >>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Jesse, I believe if one format gets special treatment > in > > >>>>> TextIO, > > >>>>>>>>>>>> people > > >>>>>>>>>>>>>>>>> will then ask why doesn't JSON, XML, ... also not > > >> supported. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Also, the example that you provide is using the fact > that > > >>> the > > >>>>>>>>>> input > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> format > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> is an Iterable<Item>. You had posted a question about > > >> using > > >>> KV > > >>>>>>>>>> with > > >>>>>>>>>>>>>>>>> TextIO.Write which wouldn't align with the proposed > input > > >>>>> format > > >>>>>>>>>>> and > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> still > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> would require to write a type conversion function, this > > >> time > > >>>>>>> from > > >>>>>>>>>>> KV > > >>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>> Iterable<Item> instead of KV to string. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson < > > >>>>>>>>>>>> [email protected]> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> Lukasz, > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> I don't think you'd need complicated logic for > > >>> TextIO.Write. > > >>>>>>> For > > >>>>>>>>>>> CSV > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> call would look like: > > >>>>>>>>>>>>>>>>>> Stringify.to("", ",", "\n"); > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Where the arguments would be Stringify.to(prefix, > > >>> delimiter, > > >>>>>>>>>>>> suffix). > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> The code would be something like: > > >>>>>>>>>>>>>>>>>> StringBuffer buffer = new StringBuffer(prefix); > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> for (Item item : list) { > > >>>>>>>>>>>>>>>>>> buffer.append(item.toString()); > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> if(notLast) { > > >>>>>>>>>>>>>>>>>> buffer.append(delimiter); > > >>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> buffer.append(suffix); > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> c.output(buffer.toString()); > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> That would allow you to do the basic CSV, TSV, and > other > > >>>>>>> formats > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> without > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> complicated logic. The same sort of thing could be done > > >> for > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> TextIO.Write. > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> Jesse > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik > > >>>>>>>>>>>> <[email protected] > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> The conversion from object to string will have uses > > >> outside > > >>>>> of > > >>>>>>>>>>> just > > >>>>>>>>>>>>>>>>>>> TextIO.Write so it seems logical that we would want > to > > >>> have > > >>>>> a > > >>>>>>>>>>> ParDo > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> do > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> conversion. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Text file formats have a lot of variance, even if you > > >>>>> consider > > >>>>>>>>>>> the > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> subset > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> of CSV like formats where it could have fixed width > > >> fields, > > >>>>> or > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> escaping > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> quoting around other fields, or headers that should > be > > >>>>> placed > > >>>>>>> at > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> top. > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Having all these format conversions within > TextIO.Write > > >>>>> seems > > >>>>>>>>>>> like > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> a > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> lot > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> of > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> logic to contain in that transform which should just > > >> focus > > >>>>> on > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> writing > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> to > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> files. > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson < > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> [email protected]> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> wrote: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> This is a thread moved over from the user mailing > list. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> I think there needs to be a way to convert a > > >>>>> PCollection<KV> > > >>>>>>> to > > >>>>>>>>>>>>>>>>>>>> PCollection<String> Conversion. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> To do a minimal WordCount, you have to manually > > convert > > >>> the > > >>>>>>> KV > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> to a > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> String: > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> p > > >>>>>>>>>>>>>>>>>>>> > > >>>>> .apply(TextIO.Read.from("playing_cards.tsv")) > > >>>>>>>>>>>>>>>>>>>> .apply(Regex.split("\\W+")) > > >>>>>>>>>>>>>>>>>>>> .apply(Count.perElement()) > > >>>>>>>>>>>>>>>>>>>> * .apply(MapElements.via((KV<String, > > >> Long> > > >>>>>>>>>> count) > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> ->* > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> * count.getKey() + ":" + > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> count.getValue()* > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> * ).withOutputType( > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> TypeDescriptors.strings()))* > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> .apply(TextIO.Write.to > > >>>>>>> ("output/stringcounts")); > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> This code really should be something like: > > >>>>>>>>>>>>>>>>>>>> p > > >>>>>>>>>>>>>>>>>>>> > > >>>>> .apply(TextIO.Read.from("playing_cards.tsv")) > > >>>>>>>>>>>>>>>>>>>> .apply(Regex.split("\\W+")) > > >>>>>>>>>>>>>>>>>>>> .apply(Count.perElement()) > > >>>>>>>>>>>>>>>>>>>> * .apply(ToString.stringify())* > > >>>>>>>>>>>>>>>>>>>> .apply(TextIO.Write.to > > >>>>>>>>> ("output/stringcounts")); > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> To summarize the discussion: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> - JA: Add a method to StringDelegateCoder to > output > > >>> any > > >>>>> KV > > >>>>>>>>>> or > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> list > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> - JA and DH: Add a SimpleFunction that takes an type > > >> and > > >>>>> runs > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> toString() > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> on it: > > >>>>>>>>>>>>>>>>>>>> class ToStringFn<InputT> extends > > >>> SimpleFunction<InputT, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> String> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> { > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> public static String apply(InputT input) { > > >>>>>>>>>>>>>>>>>>>> return input.toString(); > > >>>>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>>>> } > > >>>>>>>>>>>>>>>>>>>> - JB: Add a general purpose type converter like > in > > >>>>> Apache > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> Camel. > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> - JA: Add Object support to TextIO.Write that would > > >> write > > >>>>> out > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> toString of any Object. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> My thoughts: > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Is converting to a PCollection<String> mostly needed > > >> when > > >>>>>>>>>> you're > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> using > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> TextIO.Write? Will a general purpose transform only > work > > >> in > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> certain > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> cases > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> and you'll normally have to write custom code format > > the > > >>>>>>> strings > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> the > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> way > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> you want them? > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object > > >> support > > >>> to > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> TextIO.Write > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> or > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> a SimpleFunction that takes a delimiter as an > > argument. > > >>>>>>> Making > > >>>>>>>>>> a > > >>>>>>>>>>>>>>>>>>>> SimpleFunction that's able to specify a delimiter > (and > > >>>>>>> perhaps > > >>>>>>>>>> a > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> prefix > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> and > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> suffix) should cover the majority of formats and > > cases. > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Thanks, > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> Jesse > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>>> > > >>>>>>>>>>>>>>> > > >>>>>>>>>>>>>> > > >>>>>>>>>>>>> -- > > >>>>>>>>>>>>> Jean-Baptiste Onofré > > >>>>>>>>>>>>> [email protected] > > >>>>>>>>>>>>> http://blog.nanthrax.net > > >>>>>>>>>>>>> Talend - http://www.talend.com > > >>>>>>>>>>>>> > > >>>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> -- > > >>>>>>>>> Jean-Baptiste Onofré > > >>>>>>>>> [email protected] > > >>>>>>>>> http://blog.nanthrax.net > > >>>>>>>>> Talend - http://www.talend.com > > >>>>>>>>> > > >>>>>>>> > > >>>>>>> > > >>>>>>> -- > > >>>>>>> Jean-Baptiste Onofré > > >>>>>>> [email protected] > > >>>>>>> http://blog.nanthrax.net > > >>>>>>> Talend - http://www.talend.com > > >>>>>>> > > >>>>>> > > >>>>> > > >>>>> -- > > >>>>> Jean-Baptiste Onofré > > >>>>> [email protected] > > >>>>> http://blog.nanthrax.net > > >>>>> Talend - http://www.talend.com > > >>>>> > > >>>> > > >>> > > >>> -- > > >>> Jean-Baptiste Onofré > > >>> [email protected] > > >>> http://blog.nanthrax.net > > >>> Talend - http://www.talend.com > > >>> > > >> > > > > > > > -- > > Jean-Baptiste Onofré > > [email protected] > > http://blog.nanthrax.net > > Talend - http://www.talend.com > > >
