Hi Vikas,

did you take a look on:

https://github.com/jbonofre/beam/tree/DATAFORMAT/sdks/java/extensions/dataformat

You can see KV2String and ToString could be part of this extension.
I'm also using JAXB for XML and Jackson for JSON marshalling/unmarshalling. I'm planning to deal with Avro (IndexedRecord).

Regards
JB

On 12/28/2016 08:37 PM, Vikas Kedigehalli wrote:
Hi All,

  Not being aware of the discussion here, I sent out a PR
<https://github.com/apache/beam/pull/1704> but JB and others directed me to
this thread. Having converted PCollection<T> to PCollection<String> several
times, I feel something like 'ToString' transform is common enough to be
part of the core. What do you all think?

Also, if someone else is already working on or interested in tackling this,
then I am happy to discard the PR.

Regards,
Vikas

On Tue, Dec 13, 2016 at 1:56 AM, Amit Sela <[email protected]> wrote:

It seems that there were a lot of good points raised here, and I tend to
agree that something as trivial and lean as "ToString" should be a part of
core.ake
I'm particularly fond of makeString(prefix, toString, suffix) in various
combinations (Scala-like).
For "fromString", I think JB has a good point leveraging JAXB and Jackson -
though I think this should be in extensions as it is not as lean as
toString.

Thanks,
Amit

On Wed, Nov 30, 2016 at 5:13 AM Jean-Baptiste Onofré <[email protected]>
wrote:

Hi Jesse,

yes, I started something there (using JAXB and Jackson). Let me polish
and push.

Regards
JB

On 11/29/2016 10:00 PM, Jesse Anderson wrote:
I went through the string conversions. Do you have an example of
writing
out XML/JSON/etc too?

On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré <[email protected]>
wrote:

Hi Jesse,



https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/
extensions/dataformat

it's very simple and stupid and of course not complete at all (I have
other commits but not merged as they need some polishing), but as I
said, it's a base of discussion.

Regards
JB

On 11/29/2016 09:23 PM, Jesse Anderson wrote:
@jb Sounds good. Just let us know once you've pushed.

On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré <
[email protected]>
wrote:

Good point Eugene.

Right now, it's a DoFn collection to experiment a bit (a pure
extension). It's pretty stupid ;)

But, you are right, depending the direction of such extension, it
could
cover more use cases (even if it's not my first intention ;)).

Let me push the branch (pretty small) as an illustration, and in the
mean time, I'm preparing a document (more focused on the use cases).

WDYT ?

Regards
JB

On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:
Hi JB,
Depending on the scope of what you want to ultimately accomplish
with
this
extension, I think it may make sense to write a proposal document
and
discuss it.
If it's just a collection of utility DoFn's for various
well-defined
source/target format pairs, then that's probably not needed, but if
it's
anything more, then I think it is.
That will help avoid a lot of churn if people propose reasonable
significant changes.

On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré <
[email protected]

wrote:

By the way Jesse, I gonna push my DATAFORMAT branch on my github
and I
will post on the dev mailing list when done.

Regards
JB

On 11/29/2016 07:01 PM, Jesse Anderson wrote:
I want to bring this thread back up since we've had time to think
about
it
more and make a plan.

I think a format-specific converter will be more time consuming
task
than
we originally thought. It'd have to be a writer that takes
another
writer
as a parameter.

I think a string converter can be done as a simple transform.

I think we should start with a simple string converter and plan
for a
format-specific writer.

What are your thoughts?

Thanks,

Jesse

On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <
[email protected]

wrote:

I was thinking about what the outputs would look like last
night. I
realized that more complex formats like JSON and XML may or may
not
output
the data in a valid format.

Doing a direct conversion on unbounded collections would work
just
fine.
They're self-contained. For writing out bounded collections,
that's
where
we'll hit the issues. This changes the uber conversion transform
into a
transform that needs to be a writer.

If a transform executes a JSON conversion on a per element basis,
we'd
get
this:
{
"key": "value"
}, {
"key": "value"
},

That isn't valid JSON.

The conversion transform would need to know do several things
when
writing
out a file. It would need to add brackets for an array. Now we
have:
[
{
"key": "value"
}, {
"key": "value"
},
]

We still don't have valid JSON. We have to remove the last comma
or
have
the uber transform start putting in the commas, except for the
last
element.

[
{
"key": "value"
}, {
"key": "value"
}
]

Only by doing this do we have valid JSON.

I'd argue we'd have a similar issue with XML. Some parsers
require
a
root
element for everything. The uber transform would have to put the
root
element tags at the beginning and end of the file.

On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang <
[email protected]>
wrote:

I would love to see a lean core and abundant Transforms at the
same
time.

Maybe we can look at what Confluent <
https://github.com/confluentinc

does
for kafka-connect. They have official extensions support for
JDBC,
HDFS
and
ElasticSearch under https://github.com/confluentinc. They put
them
along
with other community extensions on
https://www.confluent.io/product/connectors/ for visibility.

Although not a commercial company, can we have a GitHub user like
beam-community to host projects we build around beam but not
suitable
for
https://github.com/apache/incubator-beam. In the future, we may
have
beam-algebra like http://github.com/twitter/algebird for algebra
operations
and beam-ml / beam-dl for machine learning / deep learning. Also,
there
will will be beam related projects elsewhere maintained by other
communities. We can put all of them on the beam-website or like
spark
packages as mentioned by Amit.

My $0.02
Manu



On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles
<[email protected]

wrote:

On this point from Amit and Ismaël, I agree: we could benefit
from a
place
for miscellaneous non-core helper transformations.

We have sdks/java/extensions but it is organized as separate
artifacts.
I
think that is fine, considering the nature of Join and
SortValues.
But
for
simpler transforms, Importing one artifact per tiny transform is
too
much
overhead. It also seems unlikely that we will have enough
commonality
among
the transforms to call the artifact anything other than [some
synonym
for]
"miscellaneous".

I wouldn't want to take this too far - even though the SDK many
transforms*
that are not required for the model [1], I like that the SDK
artifact
has
everything a user might need in their "getting started" phase of
use.
This
user-friendliness (the user doesn't care that ParDo is core and
Sum
is
not)
plus the difficulty of judging which transforms go where, are
probably
why
we have them mostly all in one place.

Models to look at, off the top of my head, include Pig's
PiggyBank
and
Apex's Malhar. These have different levels of support implied.
Others?

Kenn

[1] ApproximateQuantiles, ApproximateUnique, Count, Distinct,
Filter,
FlatMapElements, Keys, Latest, MapElements, Max, Mean, Min,
Values,
KvSwap,
Partition, Regex, Sample, Sum, Top, Values, WithKeys,
WithTimestamps

* at least they are separate classes and not methods on
PCollection
:-)


On Wed, Nov 9, 2016 at 6:03 AM, Ismaël Mejía <[email protected]

wrote:

​Nice discussion, and thanks Jesse for bringing this subject
back.

I agree 100% with Amit and the idea of having a home for those
transforms
that are not core enough to be part of the sdk, but that we all
end
up
re-writing somehow.

This is a needed improvement to be more developer friendly, but
also
as
a
reference of good practices of Beam development, and for this
reason
I
agree with JB that at this moment it would be better for these
transforms
to reside in the Beam repository at least for visibility
reasons.

One additional question is if these transforms represent a
different
DSL
or
if those could be grouped with the current extensions (e.g.
Join
and
SortValues) into something more general that we as a community
could
maintain, but well even if it is not the case, it would be
really
nice
to
start working on something like this.

Ismaël Mejía​


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré <
[email protected]

wrote:

Related to spark-package, we also have Apache Bahir to host
connectors/transforms for Spark and Flink.

IMHO, right now, Beam should host this, not sure if it makes
sense
directly in the core.

It reminds me the "Integration" DSL we discussed in the
technical
vision
document.

Regards
JB


On 11/09/2016 11:17 AM, Amit Sela wrote:

I think Jesse has a very good point on one hand, while Luke's
and
Kenneth's
worries about committing users to specific implementations is
in
place.

The Spark community has a 3rd party repository for useful
libraries
that
for various reasons are not a part of the Apache Spark
project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both
users
quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles
<[email protected]

wrote:

It seems useful for small scale debugging / demoing to have
Dump.toString(). I think it should be named to clearly
indicate
its
limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should
be
pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML,
etc,
is
already
the subject of many mature libraries which are already easy
to
use
with
Beam.

The more esoteric practice of implicit or semi-implicit
coercions
seems
like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via(<lambda>) and also easy to use with Beam.

In both of the last cases, there are many reasonable
approaches,
and
we
shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik
<[email protected]

wrote:

The suggestions you give seem good except for the the XML
cases.

Might want to have the XML be a document per line similar
to
the
JSON
examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <
[email protected]>
wrote:

@lukasz Agreed there would have to be KV handling. I was
more
think

that

whatever the addition, it shouldn't just handle KV. It
should
handle
Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone
something
general purpose enough that you would just end up writing
your
own
code

to

handle it anyway.

Here are some ideas on what it could look like with a
method
and
the
resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:
<rootelement key=value />

With Iterables:
<rootelement>
  <item>one</item>
  <item>two</item>
  <item>three</item>
</rootelement>

*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between
reusable
code
and
writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik
<[email protected]

wrote:

Jesse, I believe if one format gets special treatment in
TextIO,
people
will then ask why doesn't JSON, XML, ... also not
supported.

Also, the example that you provide is using the fact that
the
input

format

is an Iterable<Item>. You had posted a question about
using
KV
with
TextIO.Write which wouldn't align with the proposed input
format
and

still

would require to write a type conversion function, this
time
from
KV
to
Iterable<Item> instead of KV to string.

On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <
[email protected]>
wrote:

Lukasz,

I don't think you'd need complicated logic for
TextIO.Write.
For
CSV

the

call would look like:
Stringify.to("", ",", "\n");

Where the arguments would be Stringify.to(prefix,
delimiter,
suffix).

The code would be something like:
StringBuffer buffer = new StringBuffer(prefix);

for (Item item : list) {
  buffer.append(item.toString());

  if(notLast) {
    buffer.append(delimiter);
  }
}

buffer.append(suffix);

c.output(buffer.toString());

That would allow you to do the basic CSV, TSV, and other
formats

without

complicated logic. The same sort of thing could be done
for

TextIO.Write.


Thanks,

Jesse

On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik
<[email protected]


wrote:

The conversion from object to string will have uses
outside
of
just
TextIO.Write so it seems logical that we would want to
have
a
ParDo

do

the

conversion.

Text file formats have a lot of variance, even if you
consider
the

subset

of CSV like formats where it could have fixed width
fields,
or

escaping

and

quoting around other fields, or headers that should be
placed
at

the

top.


Having all these format conversions within TextIO.Write
seems
like

a

lot

of

logic to contain in that transform which should just
focus
on

writing

to

files.

On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson <

[email protected]>

wrote:

This is a thread moved over from the user mailing list.

I think there needs to be a way to convert a
PCollection<KV>
to
PCollection<String> Conversion.

To do a minimal WordCount, you have to manually convert
the
KV

to a

String:

        p

 .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
*                .apply(MapElements.via((KV<String,
Long>
count)

->*

*                            count.getKey() + ":" +

count.getValue()*

*                        ).withOutputType(

TypeDescriptors.strings()))*

                .apply(TextIO.Write.to
("output/stringcounts"));

This code really should be something like:
        p

 .apply(TextIO.Read.from("playing_cards.tsv"))
                .apply(Regex.split("\\W+"))
                .apply(Count.perElement())
*                .apply(ToString.stringify())*
                .apply(TextIO.Write.to
("output/stringcounts"));

To summarize the discussion:

   - JA: Add a method to StringDelegateCoder to output
any
KV
or

list

   - JA and DH: Add a SimpleFunction that takes an type
and
runs

toString()

   on it:
   class ToStringFn<InputT> extends
SimpleFunction<InputT,

String>

{

       public static String apply(InputT input) {
           return input.toString();
       }
   }
   - JB: Add a general purpose type converter like in
Apache

Camel.

   - JA: Add Object support to TextIO.Write that would
write
out

the

   toString of any Object.

My thoughts:

Is converting to a PCollection<String> mostly needed
when
you're

using

TextIO.Write? Will a general purpose transform only work
in

certain

cases

and you'll normally have to write custom code format the
strings

the

way

you want them?

IMHO, it's yes to both. I'd prefer to add Object
support
to

TextIO.Write

or

a SimpleFunction that takes a delimiter as an argument.
Making
a
SimpleFunction that's able to specify a delimiter (and
perhaps
a

prefix

and

suffix) should cover the majority of formats and cases.

Thanks,

Jesse








--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com



--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com



--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com



--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com




--
Jean-Baptiste Onofré
[email protected]
http://blog.nanthrax.net
Talend - http://www.talend.com

Reply via email to