Re: PCollection to PCollection Conversion

Ismaël Mejía Wed, 09 Nov 2016 06:04:32 -0800

Nice discussion, and thanks Jesse for bringing this subject back.

I agree 100% with Amit and the idea of having a home for those transforms
that are not core enough to be part of the sdk, but that we all end up
re-writing somehow.


This is a needed improvement to be more developer friendly, but also as a
reference of good practices of Beam development, and for this reason I
agree with JB that at this moment it would be better for these transforms
to reside in the Beam repository at least for visibility reasons.

One additional question is if these transforms represent a different DSL or
if those could be grouped with the current extensions (e.g. Join and
SortValues) into something more general that we as a community could
maintain, but well even if it is not the case, it would be really nice to
start working on something like this.

Ismaël Mejía


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <[email protected]>
wrote:

> Related to spark-package, we also have Apache Bahir to host
> connectors/transforms for Spark and Flink.
>
> IMHO, right now, Beam should host this, not sure if it makes sense
> directly in the core.
>
> It reminds me the "Integration" DSL we discussed in the technical vision
> document.
>
> Regards
> JB
>
>
> On 11/09/2016 11:17 AM, Amit Sela wrote:
>
>> I think Jesse has a very good point on one hand, while Luke's and
>> Kenneth's
>> worries about committing users to specific implementations is in place.
>>
>> The Spark community has a 3rd party repository for useful libraries that
>> for various reasons are not a part of the Apache Spark project:
>> https://spark-packages.org/.
>>
>> Maybe a "common-transformations" package would serve both users quick
>> ramp-up and ease-of-use while keeping Beam more "enabling" ?
>>
>> On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles <[email protected]>
>> wrote:
>>
>> It seems useful for small scale debugging / demoing to have
>>> Dump.toString(). I think it should be named to clearly indicate its
>>> limited
>>> scope. Maybe other stuff could go in the Dump namespace, but
>>> "Dump.toJson()" would be for humans to read - so it should be pretty
>>> printed, not treated as a machine-to-machine wire format.
>>>
>>> The broader question of representing data in JSON or XML, etc, is already
>>> the subject of many mature libraries which are already easy to use with
>>> Beam.
>>>
>>> The more esoteric practice of implicit or semi-implicit coercions seems
>>> like it is also already addressed in many ways elsewhere.
>>> Transform.via(TypeConverter) is basically the same as
>>> MapElements.via(<lambda>) and also easy to use with Beam.
>>>
>>> In both of the last cases, there are many reasonable approaches, and we
>>> shouldn't commit our users to one of them.
>>>
>>> On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik <[email protected]>
>>> wrote:
>>>
>>> The suggestions you give seem good except for the the XML cases.
>>>>
>>>> Might want to have the XML be a document per line similar to the JSON
>>>> examples you have been giving.
>>>>
>>>> On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <[email protected]>
>>>> wrote:
>>>>
>>>> @lukasz Agreed there would have to be KV handling. I was more think
>>>>>
>>>> that
>>>
>>>> whatever the addition, it shouldn't just handle KV. It should handle
>>>>> Iterables, Lists, Sets, and KVs.
>>>>>
>>>>> For JSON and XML, I wonder if we'd be able to give someone something
>>>>> general purpose enough that you would just end up writing your own code
>>>>>
>>>> to
>>>>
>>>>> handle it anyway.
>>>>>
>>>>> Here are some ideas on what it could look like with a method and the
>>>>> resulting string output:
>>>>> *Stringify.toJSON()*
>>>>>
>>>>> With KV:
>>>>> {"key": "value"}
>>>>>
>>>>> With Iterables:
>>>>> ["one", "two", "three"]
>>>>>
>>>>> *Stringify.toXML("rootelement")*
>>>>>
>>>>> With KV:
>>>>> <rootelement key=value />
>>>>>
>>>>> With Iterables:
>>>>> <rootelement>
>>>>>   <item>one</item>
>>>>>   <item>two</item>
>>>>>   <item>three</item>
>>>>> </rootelement>
>>>>>
>>>>> *Stringify.toDelimited(",")*
>>>>>
>>>>> With KV:
>>>>> key,value
>>>>>
>>>>> With Iterables:
>>>>> one,two,three
>>>>>
>>>>> Do you think that would strike a good balance between reusable code and
>>>>> writing your own for more difficult formatting?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jesse
>>>>>
>>>>> On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Jesse, I believe if one format gets special treatment in TextIO, people
>>>>> will then ask why doesn't JSON, XML, ... also not supported.
>>>>>
>>>>> Also, the example that you provide is using the fact that the input
>>>>>
>>>> format
>>>>
>>>>> is an Iterable<Item>. You had posted a question about using KV with
>>>>> TextIO.Write which wouldn't align with the proposed input format and
>>>>>
>>>> still
>>>>
>>>>> would require to write a type conversion function, this time from KV to
>>>>> Iterable<Item> instead of KV to string.
>>>>>
>>>>> On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Lukasz,
>>>>>>
>>>>>> I don't think you'd need complicated logic for TextIO.Write. For CSV
>>>>>>
>>>>> the
>>>>
>>>>> call would look like:
>>>>>> Stringify.to("", ",", "\n");
>>>>>>
>>>>>> Where the arguments would be Stringify.to(prefix, delimiter, suffix).
>>>>>>
>>>>>> The code would be something like:
>>>>>> StringBuffer buffer = new StringBuffer(prefix);
>>>>>>
>>>>>> for (Item item : list) {
>>>>>>   buffer.append(item.toString());
>>>>>>
>>>>>>   if(notLast) {
>>>>>>     buffer.append(delimiter);
>>>>>>   }
>>>>>> }
>>>>>>
>>>>>> buffer.append(suffix);
>>>>>>
>>>>>> c.output(buffer.toString());
>>>>>>
>>>>>> That would allow you to do the basic CSV, TSV, and other formats
>>>>>>
>>>>> without
>>>>
>>>>> complicated logic. The same sort of thing could be done for
>>>>>>
>>>>> TextIO.Write.
>>>>
>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Jesse
>>>>>>
>>>>>> On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik <[email protected]
>>>>>>
>>>>>
>>>> wrote:
>>>>>>
>>>>>> The conversion from object to string will have uses outside of just
>>>>>>> TextIO.Write so it seems logical that we would want to have a ParDo
>>>>>>>
>>>>>> do
>>>>
>>>>> the
>>>>>>
>>>>>>> conversion.
>>>>>>>
>>>>>>> Text file formats have a lot of variance, even if you consider the
>>>>>>>
>>>>>> subset
>>>>>
>>>>>> of CSV like formats where it could have fixed width fields, or
>>>>>>>
>>>>>> escaping
>>>>
>>>>> and
>>>>>>
>>>>>>> quoting around other fields, or headers that should be placed at
>>>>>>>
>>>>>> the
>>>
>>>> top.
>>>>>
>>>>>>
>>>>>>> Having all these format conversions within TextIO.Write seems like
>>>>>>>
>>>>>> a
>>>
>>>> lot
>>>>>
>>>>>> of
>>>>>>
>>>>>>> logic to contain in that transform which should just focus on
>>>>>>>
>>>>>> writing
>>>
>>>> to
>>>>>
>>>>>> files.
>>>>>>>
>>>>>>> On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <
>>>>>>>
>>>>>> [email protected]>
>>>>
>>>>> wrote:
>>>>>>>
>>>>>>> This is a thread moved over from the user mailing list.
>>>>>>>>
>>>>>>>> I think there needs to be a way to convert a PCollection<KV> to
>>>>>>>> PCollection<String> Conversion.
>>>>>>>>
>>>>>>>> To do a minimal WordCount, you have to manually convert the KV
>>>>>>>>
>>>>>>> to a
>>>
>>>> String:
>>>>>>>
>>>>>>>>         p
>>>>>>>>                 .apply(TextIO.Read.from("playing_cards.tsv"))
>>>>>>>>                 .apply(Regex.split("\\W+"))
>>>>>>>>                 .apply(Count.perElement())
>>>>>>>> *                .apply(MapElements.via((KV<String, Long> count)
>>>>>>>>
>>>>>>> ->*
>>>>
>>>>> *                            count.getKey() + ":" +
>>>>>>>>
>>>>>>> count.getValue()*
>>>>
>>>>> *                        ).withOutputType(
>>>>>>>>
>>>>>>> TypeDescriptors.strings()))*
>>>>>
>>>>>>                 .apply(TextIO.Write.to("output/stringcounts"));
>>>>>>>>
>>>>>>>> This code really should be something like:
>>>>>>>>         p
>>>>>>>>                 .apply(TextIO.Read.from("playing_cards.tsv"))
>>>>>>>>                 .apply(Regex.split("\\W+"))
>>>>>>>>                 .apply(Count.perElement())
>>>>>>>> *                .apply(ToString.stringify())*
>>>>>>>>                 .apply(TextIO.Write.to("output/stringcounts"));
>>>>>>>>
>>>>>>>> To summarize the discussion:
>>>>>>>>
>>>>>>>>    - JA: Add a method to StringDelegateCoder to output any KV or
>>>>>>>>
>>>>>>> list
>>>>
>>>>>    - JA and DH: Add a SimpleFunction that takes an type and runs
>>>>>>>>
>>>>>>> toString()
>>>>>>>
>>>>>>>>    on it:
>>>>>>>>    class ToStringFn<InputT> extends SimpleFunction<InputT,
>>>>>>>>
>>>>>>> String>
>>>
>>>> {
>>>>
>>>>>        public static String apply(InputT input) {
>>>>>>>>            return input.toString();
>>>>>>>>        }
>>>>>>>>    }
>>>>>>>>    - JB: Add a general purpose type converter like in Apache
>>>>>>>>
>>>>>>> Camel.
>>>
>>>>    - JA: Add Object support to TextIO.Write that would write out
>>>>>>>>
>>>>>>> the
>>>>
>>>>>    toString of any Object.
>>>>>>>>
>>>>>>>> My thoughts:
>>>>>>>>
>>>>>>>> Is converting to a PCollection<String> mostly needed when you're
>>>>>>>>
>>>>>>> using
>>>>>
>>>>>> TextIO.Write? Will a general purpose transform only work in
>>>>>>>>
>>>>>>> certain
>>>
>>>> cases
>>>>>>
>>>>>>> and you'll normally have to write custom code format the strings
>>>>>>>>
>>>>>>> the
>>>>
>>>>> way
>>>>>>
>>>>>>> you want them?
>>>>>>>>
>>>>>>>> IMHO, it's yes to both. I'd prefer to add Object support to
>>>>>>>>
>>>>>>> TextIO.Write
>>>>>>
>>>>>>> or
>>>>>>>
>>>>>>>> a SimpleFunction that takes a delimiter as an argument. Making a
>>>>>>>> SimpleFunction that's able to specify a delimiter (and perhaps a
>>>>>>>>
>>>>>>> prefix
>>>>>
>>>>>> and
>>>>>>>
>>>>>>>> suffix) should cover the majority of formats and cases.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Jesse
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> [email protected]
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>

Re: PCollection to PCollection Conversion

Reply via email to