Re: Towards a spec for robust streaming SQL, Part 2

2017-08-01 Thread Tyler Akidau
Thank you all for the comments/input, I appreciate the time you've put into
this. I've responded to a handful of the major ones. There are some more
I'd like to respond to, but I'm out of time for tonight, so more tomorrow.

-Tyler

On Tue, Aug 1, 2017 at 12:24 PM Julian Hyde  wrote:

> I have problems with a couple of the axioms: that a SQL object is
> either a table or a stream, but not both; and that a query is bounded
> if and only if it contains no unbounded streams.
>
> I don't have problems with other axioms, such that a query is either
> bounded or unbounded. And I haven't looked in detail at triggering
> semantics; I don't think there will be major issues, but let's clear
> up the 2 problems above first.
>
> I have added a section "Julian’s thoughts on the fundamentals" to the
> end of the document.
>
> Julian
>
>
> On Tue, Aug 1, 2017 at 6:40 AM, Fabian Hueske  wrote:
> > As promised, I went of the document and made some comments.
> > I also added a bit of information about the current SQL support in Flink
> > and its internals.
> >
> > Thanks, Fabian
> >
> > 2017-07-30 13:22 GMT+02:00 Shaoxuan Wang :
> >
> >> Hi Tyler,
> >> Thanks for putting all the efforts into a doc. It is really well written
> >> and organized.
> >> I like the most part. The major concern I have is about the "explicit
> >> trigger". I left a few comments towards this and would like to know what
> >> the others think about it.
> >>
> >> Regards,
> >> Shaoxuan
> >>
> >> On Sun, Jul 30, 2017 at 4:43 PM, Fabian Hueske 
> wrote:
> >>
> >> > Thanks for the great write up!
> >> >
> >> > I think this s very good starting point for a detailed discussion
> about
> >> > features, syntax and semantics of streaming SQL.
> >> > I'll comment on the document in the next days and describe Flink's
> >> current
> >> > status, our approaches (or planned approaches) and ask a couple of
> >> > questions.
> >> >
> >> > Thanks, Fabian
> >> >
> >> > 2017-07-28 3:05 GMT+02:00 Julian Hyde :
> >> >
> >> > > Tyler,
> >> > >
> >> > > Thanks for this. I am reading the document thoroughly and will give
> my
> >> > > feedback in a day or two.
> >> > >
> >> > > Julian
> >> > >
> >> > > > On Jul 25, 2017, at 12:54 PM, Pramod Immaneni <
> >> pra...@datatorrent.com>
> >> > > wrote:
> >> > > >
> >> > > > Thanks for the invitation Tyler. I am sure folks who worked on the
> >> > > calcite
> >> > > > integration and others would be interested.
> >> > > >
> >> > > > On Tue, Jul 25, 2017 at 12:12 PM, Tyler Akidau
> >> > > 
> >> > > > wrote:
> >> > > >
> >> > > >> +d...@apex.apache.org, since I'm told Apex has a Calcite
> integration
> >> > as
> >> > > >> well. If anyone on the Apex side wants to join in on the fun,
> your
> >> > input
> >> > > >> would be welcomed!
> >> > > >>
> >> > > >> -Tyler
> >> > > >>
> >> > > >>
> >> > > >> On Mon, Jul 24, 2017 at 4:34 PM Tyler Akidau  >
> >> > > wrote:
> >> > > >>
> >> > > >>> Hello Flink, Calcite, and Beam dev lists!
> >> > > >>>
> >> > > >>> Linked below is the second document I promised way back in April
> >> > > >> regarding
> >> > > >>> a collaborative spec for streaming SQL in Beam/Calcite/Flink (&
> >> > > apologies
> >> > > >>> for the delay; I thought I was nearly done a while back and then
> >> > > temporal
> >> > > >>> joins expanded to something much larger than expected).
> >> > > >>>
> >> > > >>> To repeat what it says in the doc, my hope is that it can serve
> >> > various
> >> > > >>> purposes over it's lifetime:
> >> > > >>>
> >> > > >>>   -
> >> > > >>>   - A discussion ground for ironing out any remaining features
> >> > > necessary
> >> > > >>>   for supporting robust streaming semantics in Calcite SQL.
> >> > > >>>
> >> > > >>>   - A rough, high-level source of truth for tracking efforts
> >> underway
> >> > > in
> >> > > >>>   support of this, currently spanning the Calcite, Flink, and
> Beam
> >> > > >> projects.
> >> > > >>>
> >> > > >>>   - A written specification of the changes that were made, for
> the
> >> > sake
> >> > > >>>   of understanding the delta after the fact.
> >> > > >>>
> >> > > >>> The first and third points are, IMO, the most important. AFAIK,
> >> there
> >> > > are
> >> > > >>> a few features missing still that need to be defined (e.g.,
> >> triggers
> >> > > >>> equivalents via EMIT, robust temporal join support). I'm also
> >> > > proposing a
> >> > > >>> clear distinction of streams and tables, which I think is
> >> important,
> >> > > but
> >> > > >>> which I believe is not the approach most folks have been taking
> in
> >> > this
> >> > > >>> area. Sorting out these open issues and then having a concise
> >> record
> >> > of
> >> > > >> the
> >> > > >>> solutions adopted will be important for providing a solid
> streaming
> >> > > >>> experience and teaching folks how to use it.
> >> > > >>>
> >> > > >>> At any rate, I would much appreciate it if anyone with an
> interest
> >> in
> >> > > >> this
> >> > > >>> stuff could please take a look and add comments/suggestions/
> >> > 

Jenkins build is unstable: beam_SeedJob #372

2017-08-01 Thread Apache Jenkins Server
See 




Re: [DISCUSS] Beam MapReduce Runner One-Pager

2017-08-01 Thread Jean-Baptiste Onofré

Hi Pei,

thanks for the update. I'm starting to plan with your branch, 
comparing/overwriting my own one.


I will keep you posted.

Regards
JB

On 08/01/2017 02:01 PM, Pei HE wrote:

I prototyped a simple MapReduce runner that can executes
Read.Bounded+ParDo+Combine:
https://github.com/peihe/incubator-beam/tree/mr-runner

I am still working on View support, and will give another update once I can
get WordCount runs.

On Sat, Jul 15, 2017 at 4:45 AM, Vikas RK  wrote:


Thanks Pei, left a few comments, but this looks exciting!

-Vikas

On 12 July 2017 at 21:52, Jean-Baptiste Onofré  wrote:


Hi,

I will push my branch with the current state of the mapreduce runner.

Regards
JB


On 07/13/2017 04:47 AM, Pei HE wrote:


Thanks guys!

I replied Kenn's comments, and looking forward to more feedbacks and
suggestions.

Also, could we add a mapreduce-runner branch?

Thanks
--
Pei


On Sat, Jul 8, 2017 at 12:42 AM, Kenneth Knowles 


wrote:

Very cool to see this. Commenting a little on the doc.


On Fri, Jul 7, 2017 at 8:41 AM, Jean-Baptiste Onofré 
wrote:

Hi Pei,


I also pumped some ideas and part of code from Crunch for the

MapReduce

runner.

I will push my changes on my github branch and share with you.

Let me take a look on your doc.

Regards
JB


On 07/07/2017 03:11 PM, Pei HE wrote:

Hi all,

While JB is working on MapReduce Runner BEAM-165
, I have spent time
reading
Apache Crunch code and drafted Beam MapReduce Runner One-Pager

(mostly
around ParDo/Flatten fusion support, and with many missing details).

I would like to start the discussion, and get people's attention of
supporting MapReduce in Beam.

Feel free to make comments and suggestions on that doc.

Thanks
--
Pei


--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Build failed in Jenkins: beam_SeedJob #371

2017-08-01 Thread Apache Jenkins Server
See 

--
GitHub pull request #3668 of commit d61ca8f8845c4290e3e88dd9da2bd94605ab141b, 
no merge conflicts.
Setting status of d61ca8f8845c4290e3e88dd9da2bd94605ab141b to PENDING with url 
https://builds.apache.org/job/beam_SeedJob/371/ and message: 'Build started 
sha1 is merged.'
Using context: Jenkins: Seed Job
[EnvInject] - Loading node environment variables.
Building remotely on beam6 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/beam.git # timeout=10
Fetching upstream changes from https://github.com/apache/beam.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/beam.git 
 > +refs/heads/*:refs/remotes/origin/* 
 > +refs/pull/3668/*:refs/remotes/origin/pr/3668/*
 > git rev-parse refs/remotes/origin/pr/3668/merge^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/pr/3668/merge^{commit} # timeout=10
Checking out Revision c883ba2a88e2a7784c66489fc50e757cac871afb 
(refs/remotes/origin/pr/3668/merge)
Commit message: "Merge d61ca8f8845c4290e3e88dd9da2bd94605ab141b into 
53c2c8fdecd82c77db7a7a87ac10d812e93d924b"
 > git config core.sparsecheckout # timeout=10
 > git checkout -f c883ba2a88e2a7784c66489fc50e757cac871afb
First time build. Skipping changelog.
Cleaning workspace
 > git rev-parse --verify HEAD # timeout=10
Resetting working tree
 > git reset --hard # timeout=10
 > git clean -fdx # timeout=10
[EnvInject] - Executing scripts and injecting environment variables after the 
SCM step.
[EnvInject] - Injecting as environment variables the properties content 
SPARK_LOCAL_IP=127.0.0.1

[EnvInject] - Variables injected successfully.
Processing DSL script job_beam_ListView_Create.groovy
Warning: (job_beam_ListView_Create.groovy, line 38) plugin 'extra-columns' 
needs to be installed
ERROR: Type of view "Beam" does not match existing type, view type can not be 
changed


Re: Requiring PTransform to set a coder on its resulting collections

2017-08-01 Thread Reuven Lax
One interesting wrinkle: I'm about to propose a set of semantics for
snapshotting/in-place updating pipelines. Part of this proposal is the
ability to write pipelines to "upgrade" snapshots to make them compatible
with new graphs. This relies on the ability to have two separate coders for
the same type - the old coder and the new coder - in order to handle the
case where the user has changed coders in the new pipeline.

On Tue, Aug 1, 2017 at 2:12 PM, Robert Bradshaw  wrote:

> There are two concerns in this thread:
>
> (1) Getting rid of PCollection.setCoder(). Everyone seems in favor of this
> (right?)
>
> (2) Deprecating specifying Coders in favor of specifying TypeDescriptors.
> I'm generally in favor, but it's unclear how far we can push this through.
>
> Let's at least do (1), and separately state a preference for (2), seeing
> how fare we can push it.
>
> On Thu, Jul 27, 2017 at 9:13 PM, Kenneth Knowles 
> wrote:
>
> > Another thought on this: setting a custom coder to support a special data
> > distribution is likely often a property of the input to the pipeline. So
> > setting a coder during pipeline construction - more generally, when
> writing
> > a composite transform for reuse - you might not actually have the needed
> > information. But setting up a special indicator type descriptor lets your
> > users map that type descriptor to a coder that works well for their data.
> >
> > But Robert's example of RawUnionValue seems like a deal breaker for all
> > approaches. It really requires .getCoder() during expand() and explicitly
> > building coders encoding information that is cumbersome to get into a
> > TypeDescriptor. While making up new type languages is a comfortable
> > activity for me :-) I don't think we should head down that path, for our
> > users' sake. So I'll stop hoping we can eliminate this pain point for
> now.
> >
> > Kenn
> >
> > On Thu, Jul 27, 2017 at 8:48 PM, Kenneth Knowles  wrote:
> >
> > > On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh  >
> > > wrote:
> > >
> > >> introduce a
> > >> new, specialized type to represent the restricted
> > >> (alternatively-distributed?) data. The TypeDescriptor for this type
> can
> > >> map
> > >> to the specialized coder, without having to perform a significant
> degree
> > >> of
> > >> potentially wasted encoding work, plus it includes the assumptions
> that
> > >> are
> > >> being made about the distribution of data.
> > >>
> > >
> > > This is a very cool idea, in theory :-)
> > >
> > > For complex types with a few allocations involved and/or nontrivial
> > > deserialization, or when a pipeline does a lot of real work, I think
> the
> > > wrapper cost won't be perceptible.
> > >
> > > But  for more primitive types in pipelines that don't really do much
> > > computation but just move data around, I think it could matter.
> Certainly
> > > there are languages with constructs to allow type wrappers at zero cost
> > > (Haskell's `newtype`).
> > >
> > > This is all just speculation until we measure, like most of this
> thread.
> > >
> > > Kenn
> > >
> > >
> > >> > On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh
> >  > >> >
> > >> > wrote:
> > >> >
> > >> > > +1 on getting rid of setCoder; just from a Java SDK perspective,
> my
> > >> ideal
> > >> > > world contains PCollections which don't have a user-visible way to
> > >> mutate
> > >> > > them.
> > >> > >
> > >> > > My preference would be to use TypeDescriptors everywhere within
> > >> Pipeline
> > >> > > construction (where possible), and utilize the CoderRegistry
> > >> everywhere
> > >> > to
> > >> > > actually extract the appropriate type. The unfortunate difficulty
> of
> > >> > having
> > >> > > to encode a union type and the lack of variable-length generics
> does
> > >> > > complicate that. We could consider some way of constructing coders
> > in
> > >> the
> > >> > > registry from a collection of type descriptors (which should be
> > >> > accessible
> > >> > > from the point the union-type is being constructed), e.g.
> something
> > >> like
> > >> > > `getCoder(TypeDescriptor output, TypeDescriptor... components)` -
> > that
> > >> > does
> > >> > > only permit a single flat level (but since this is being invoked
> by
> > >> the
> > >> > SDK
> > >> > > during construction it could also pass Coder...).
> > >> > >
> > >> > >
> > >> > >
> > >> > > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
> > >> > > rober...@google.com.invalid> wrote:
> > >> > >
> > >> > > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
> > >> > > >  wrote:
> > >> > > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
> > >> >  > >> > > >
> > >> > > > > wrote:
> > >> > > > >>
> > >> > > > >> Ken/Robert, I believe users will want the ability to set the
> > >> output
> > >> > > > coder
> > >> > > > >> because coders may have intrinsic properties where the type
> > isn't
> > >> > > enough
> > >> > > > >> information to fully specify what I want as a user. Some
> cases
> > I
> > >> can
> > >> > > see
> > >> > > > 

Re: Requiring PTransform to set a coder on its resulting collections

2017-08-01 Thread Robert Bradshaw
There are two concerns in this thread:

(1) Getting rid of PCollection.setCoder(). Everyone seems in favor of this
(right?)

(2) Deprecating specifying Coders in favor of specifying TypeDescriptors.
I'm generally in favor, but it's unclear how far we can push this through.

Let's at least do (1), and separately state a preference for (2), seeing
how fare we can push it.

On Thu, Jul 27, 2017 at 9:13 PM, Kenneth Knowles 
wrote:

> Another thought on this: setting a custom coder to support a special data
> distribution is likely often a property of the input to the pipeline. So
> setting a coder during pipeline construction - more generally, when writing
> a composite transform for reuse - you might not actually have the needed
> information. But setting up a special indicator type descriptor lets your
> users map that type descriptor to a coder that works well for their data.
>
> But Robert's example of RawUnionValue seems like a deal breaker for all
> approaches. It really requires .getCoder() during expand() and explicitly
> building coders encoding information that is cumbersome to get into a
> TypeDescriptor. While making up new type languages is a comfortable
> activity for me :-) I don't think we should head down that path, for our
> users' sake. So I'll stop hoping we can eliminate this pain point for now.
>
> Kenn
>
> On Thu, Jul 27, 2017 at 8:48 PM, Kenneth Knowles  wrote:
>
> > On Thu, Jul 27, 2017 at 11:18 AM, Thomas Groh 
> > wrote:
> >
> >> introduce a
> >> new, specialized type to represent the restricted
> >> (alternatively-distributed?) data. The TypeDescriptor for this type can
> >> map
> >> to the specialized coder, without having to perform a significant degree
> >> of
> >> potentially wasted encoding work, plus it includes the assumptions that
> >> are
> >> being made about the distribution of data.
> >>
> >
> > This is a very cool idea, in theory :-)
> >
> > For complex types with a few allocations involved and/or nontrivial
> > deserialization, or when a pipeline does a lot of real work, I think the
> > wrapper cost won't be perceptible.
> >
> > But  for more primitive types in pipelines that don't really do much
> > computation but just move data around, I think it could matter. Certainly
> > there are languages with constructs to allow type wrappers at zero cost
> > (Haskell's `newtype`).
> >
> > This is all just speculation until we measure, like most of this thread.
> >
> > Kenn
> >
> >
> >> > On Thu, Jul 27, 2017 at 11:00 AM, Thomas Groh
>  >> >
> >> > wrote:
> >> >
> >> > > +1 on getting rid of setCoder; just from a Java SDK perspective, my
> >> ideal
> >> > > world contains PCollections which don't have a user-visible way to
> >> mutate
> >> > > them.
> >> > >
> >> > > My preference would be to use TypeDescriptors everywhere within
> >> Pipeline
> >> > > construction (where possible), and utilize the CoderRegistry
> >> everywhere
> >> > to
> >> > > actually extract the appropriate type. The unfortunate difficulty of
> >> > having
> >> > > to encode a union type and the lack of variable-length generics does
> >> > > complicate that. We could consider some way of constructing coders
> in
> >> the
> >> > > registry from a collection of type descriptors (which should be
> >> > accessible
> >> > > from the point the union-type is being constructed), e.g. something
> >> like
> >> > > `getCoder(TypeDescriptor output, TypeDescriptor... components)` -
> that
> >> > does
> >> > > only permit a single flat level (but since this is being invoked by
> >> the
> >> > SDK
> >> > > during construction it could also pass Coder...).
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Jul 27, 2017 at 10:22 AM, Robert Bradshaw <
> >> > > rober...@google.com.invalid> wrote:
> >> > >
> >> > > > On Thu, Jul 27, 2017 at 10:04 AM, Kenneth Knowles
> >> > > >  wrote:
> >> > > > > On Thu, Jul 27, 2017 at 2:22 AM, Lukasz Cwik
> >> >  >> > > >
> >> > > > > wrote:
> >> > > > >>
> >> > > > >> Ken/Robert, I believe users will want the ability to set the
> >> output
> >> > > > coder
> >> > > > >> because coders may have intrinsic properties where the type
> isn't
> >> > > enough
> >> > > > >> information to fully specify what I want as a user. Some cases
> I
> >> can
> >> > > see
> >> > > > >> are:
> >> > > > >> 1) I have a cheap and fast non-deterministic coder but a
> >> different
> >> > > > slower
> >> > > > >> coder when I want to use it as the key to a GBK, For example
> >> with a
> >> > > set
> >> > > > >> coder, it would need to consistently order the values of the
> set
> >> > when
> >> > > > used
> >> > > > >> as the key.
> >> > > > >> 2) I know a property of the data which allows me to have a
> >> cheaper
> >> > > > >> encoding. Imagine I know that all the strings have a common
> >> prefix
> >> > or
> >> > > > >> integers that are in a certain range, or that a matrix is
> >> > > sparse/dense.
> >> > > > Not
> >> > > > >> all PCollections of strings / integers / matrices in the
> pipeline
> >> > will
> >> > > > 

Re: Proposed API for a Whole File IO

2017-08-01 Thread Robert Bradshaw
On Tue, Aug 1, 2017 at 1:42 PM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:

> Hi,
> As mentioned on the PR - I support the creation of such an IO (both read
> and write) with the caveats that Reuven mentioned; we can refine the naming
> during code review.
> Note that you won't be able to create a PCollection because
> elements of a PCollection must have a coder and it's not possible to
> provide a coder for InputStream.


Well, it's possible, but the fact that InputStream is mutable may cause
issues (e.g. if there's fusion, or when estimating its size).

I would probably let the API consume/produce a PCollection>. Alternatively, a FileWrapper object of some kind could provide
accessors to InputStream (or otherwise facilitate lazy reading).

Note for the sink one must take care there's no race in case multiple
workers are attempting to process the same bundle (and ideally cleanup in
the face of failure). Other than that, these could be entirely done in the
context of a DoFn.

Also not that most filesystems, especially distributed ones, do better
reading and writing fewer larger files than many, many small ones.


> On Tue, Aug 1, 2017 at 1:33 PM Reuven Lax 
> wrote:
>
> > One thing to keep in mind is that many runners might have issues with
> huge
> > elements. If you have a 5gb file, encoding it as a single element might
> > give you pain or might simply not work on some runners.
> >
> > Reuven
> >
> > On Tue, Aug 1, 2017 at 1:22 PM, Chris Hebert <
> > chris.hebert-...@digitalreasoning.com> wrote:
> >
> > > Hi,
> > >
> > > I'd like to:
> > >
> > >1. Read whole files as one input each. (If my input files are
> hi.txt,
> > >what.txt, and yes.txt, then the whole contents of hi.txt are an
> > element
> > > of
> > >the returned PCollection, the whole contents of what.txt are the
> next
> > >element, etc.)
> > >2. Write elements as individual files. (Rather than smashing
> thousands
> > >of outputs into a handful of files as TextIO does:
> > > output-0-of-5,
> > >output-1-of-5,..., I want to output each thing individually.
> > So
> > > if
> > >I'm given hi.txt, what.txt, yes.txt then I'd like to read those in
> as
> > > whole
> > >files individually, then write out my processed results as
> > > hi.txt-modified,
> > >what.txt-modified, yes.txt-modified.).
> > >
> > > Before reading on, if you have easier ways to do these things, then I'd
> > > love to hear them!
> > >
> > > # Part 1
> > >
> > > I attempted Part 1 with this PR:
> > > https://github.com/apache/beam/pull/3543
> > >
> > > But I overgeneralized.
> > >
> > > Specifically, I want:
> > >
> > > Pipeline p = Pipeline.create(options);
> > > PCollection wholeFilesAsStrings = p.apply("Read Whole Files
> from
> > > Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));
> > >
> > > or
> > >
> > > Pipeline p = Pipeline.create(options);
> > > PCollection wholeFileStreams = p.apply("Read Whole Files
> > from
> > > Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));
> > >
> > > Bonus points would include:
> > >
> > >- Keeping the filename somehow
> > >
> > >
> > > # Part 2
> > >
> > > Currently, if you output hundreds of thousands of elements (batch-mode)
> > > with TextIO in Beam on Flink with, say, 45 TaskManagers, then you get
> > > output-0-of-00045, output-1-of-00045, etc., and each one of
> those
> > > files contain tens of thousands of outputs back to back. I want them to
> > be
> > > output as individual files.
> > >
> > > If appended after code snippets from Part 1, it would look like:
> > >
> > > ...
> > > p.apply("Write Whole File Outputs",
> > > WholeFileIO.write().to("/path/to/output/dir/"));
> > >
> > > Bonus points would include:
> > >
> > >- Writing each element of the given PCollection to the filename
> they'd
> > >like to go to.
> > >- Parallelizable. (This might already be done, I just noticed that
> my
> > >Beam+Flink+YARN pipeline with TextIO.write() only had one
> TaskManager
> > >writing the DataSink output even though all other components of my
> > > pipeline
> > >had many TaskManagers working on them simultaneously. I haven't
> found
> > > the
> > >way to fix that yet. The current arrangement added 15 minutes to the
> > > end of
> > >my pipeline as the lonely TaskManager did all the output.)
> > >
> > >
> > > I'm available to put the dev work into this. (Actually, I'm putting dev
> > > time into some kind of solution whether this is agreed upon or not :).
> > >
> > > Feedback, please,
> > > Chris
> > >
> >
>


Re: Proposed API for a Whole File IO

2017-08-01 Thread Eugene Kirpichov
Hi,
As mentioned on the PR - I support the creation of such an IO (both read
and write) with the caveats that Reuven mentioned; we can refine the naming
during code review.
Note that you won't be able to create a PCollection because
elements of a PCollection must have a coder and it's not possible to
provide a coder for InputStream.

On Tue, Aug 1, 2017 at 1:33 PM Reuven Lax  wrote:

> One thing to keep in mind is that many runners might have issues with huge
> elements. If you have a 5gb file, encoding it as a single element might
> give you pain or might simply not work on some runners.
>
> Reuven
>
> On Tue, Aug 1, 2017 at 1:22 PM, Chris Hebert <
> chris.hebert-...@digitalreasoning.com> wrote:
>
> > Hi,
> >
> > I'd like to:
> >
> >1. Read whole files as one input each. (If my input files are hi.txt,
> >what.txt, and yes.txt, then the whole contents of hi.txt are an
> element
> > of
> >the returned PCollection, the whole contents of what.txt are the next
> >element, etc.)
> >2. Write elements as individual files. (Rather than smashing thousands
> >of outputs into a handful of files as TextIO does:
> > output-0-of-5,
> >output-1-of-5,..., I want to output each thing individually.
> So
> > if
> >I'm given hi.txt, what.txt, yes.txt then I'd like to read those in as
> > whole
> >files individually, then write out my processed results as
> > hi.txt-modified,
> >what.txt-modified, yes.txt-modified.).
> >
> > Before reading on, if you have easier ways to do these things, then I'd
> > love to hear them!
> >
> > # Part 1
> >
> > I attempted Part 1 with this PR:
> > https://github.com/apache/beam/pull/3543
> >
> > But I overgeneralized.
> >
> > Specifically, I want:
> >
> > Pipeline p = Pipeline.create(options);
> > PCollection wholeFilesAsStrings = p.apply("Read Whole Files from
> > Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));
> >
> > or
> >
> > Pipeline p = Pipeline.create(options);
> > PCollection wholeFileStreams = p.apply("Read Whole Files
> from
> > Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));
> >
> > Bonus points would include:
> >
> >- Keeping the filename somehow
> >
> >
> > # Part 2
> >
> > Currently, if you output hundreds of thousands of elements (batch-mode)
> > with TextIO in Beam on Flink with, say, 45 TaskManagers, then you get
> > output-0-of-00045, output-1-of-00045, etc., and each one of those
> > files contain tens of thousands of outputs back to back. I want them to
> be
> > output as individual files.
> >
> > If appended after code snippets from Part 1, it would look like:
> >
> > ...
> > p.apply("Write Whole File Outputs",
> > WholeFileIO.write().to("/path/to/output/dir/"));
> >
> > Bonus points would include:
> >
> >- Writing each element of the given PCollection to the filename they'd
> >like to go to.
> >- Parallelizable. (This might already be done, I just noticed that my
> >Beam+Flink+YARN pipeline with TextIO.write() only had one TaskManager
> >writing the DataSink output even though all other components of my
> > pipeline
> >had many TaskManagers working on them simultaneously. I haven't found
> > the
> >way to fix that yet. The current arrangement added 15 minutes to the
> > end of
> >my pipeline as the lonely TaskManager did all the output.)
> >
> >
> > I'm available to put the dev work into this. (Actually, I'm putting dev
> > time into some kind of solution whether this is agreed upon or not :).
> >
> > Feedback, please,
> > Chris
> >
>


Re: Proposed API for a Whole File IO

2017-08-01 Thread Reuven Lax
One thing to keep in mind is that many runners might have issues with huge
elements. If you have a 5gb file, encoding it as a single element might
give you pain or might simply not work on some runners.

Reuven

On Tue, Aug 1, 2017 at 1:22 PM, Chris Hebert <
chris.hebert-...@digitalreasoning.com> wrote:

> Hi,
>
> I'd like to:
>
>1. Read whole files as one input each. (If my input files are hi.txt,
>what.txt, and yes.txt, then the whole contents of hi.txt are an element
> of
>the returned PCollection, the whole contents of what.txt are the next
>element, etc.)
>2. Write elements as individual files. (Rather than smashing thousands
>of outputs into a handful of files as TextIO does:
> output-0-of-5,
>output-1-of-5,..., I want to output each thing individually. So
> if
>I'm given hi.txt, what.txt, yes.txt then I'd like to read those in as
> whole
>files individually, then write out my processed results as
> hi.txt-modified,
>what.txt-modified, yes.txt-modified.).
>
> Before reading on, if you have easier ways to do these things, then I'd
> love to hear them!
>
> # Part 1
>
> I attempted Part 1 with this PR:
> https://github.com/apache/beam/pull/3543
>
> But I overgeneralized.
>
> Specifically, I want:
>
> Pipeline p = Pipeline.create(options);
> PCollection wholeFilesAsStrings = p.apply("Read Whole Files from
> Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));
>
> or
>
> Pipeline p = Pipeline.create(options);
> PCollection wholeFileStreams = p.apply("Read Whole Files from
> Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));
>
> Bonus points would include:
>
>- Keeping the filename somehow
>
>
> # Part 2
>
> Currently, if you output hundreds of thousands of elements (batch-mode)
> with TextIO in Beam on Flink with, say, 45 TaskManagers, then you get
> output-0-of-00045, output-1-of-00045, etc., and each one of those
> files contain tens of thousands of outputs back to back. I want them to be
> output as individual files.
>
> If appended after code snippets from Part 1, it would look like:
>
> ...
> p.apply("Write Whole File Outputs",
> WholeFileIO.write().to("/path/to/output/dir/"));
>
> Bonus points would include:
>
>- Writing each element of the given PCollection to the filename they'd
>like to go to.
>- Parallelizable. (This might already be done, I just noticed that my
>Beam+Flink+YARN pipeline with TextIO.write() only had one TaskManager
>writing the DataSink output even though all other components of my
> pipeline
>had many TaskManagers working on them simultaneously. I haven't found
> the
>way to fix that yet. The current arrangement added 15 minutes to the
> end of
>my pipeline as the lonely TaskManager did all the output.)
>
>
> I'm available to put the dev work into this. (Actually, I'm putting dev
> time into some kind of solution whether this is agreed upon or not :).
>
> Feedback, please,
> Chris
>


Proposed API for a Whole File IO

2017-08-01 Thread Chris Hebert
Hi,

I'd like to:

   1. Read whole files as one input each. (If my input files are hi.txt,
   what.txt, and yes.txt, then the whole contents of hi.txt are an element of
   the returned PCollection, the whole contents of what.txt are the next
   element, etc.)
   2. Write elements as individual files. (Rather than smashing thousands
   of outputs into a handful of files as TextIO does: output-0-of-5,
   output-1-of-5,..., I want to output each thing individually. So if
   I'm given hi.txt, what.txt, yes.txt then I'd like to read those in as whole
   files individually, then write out my processed results as hi.txt-modified,
   what.txt-modified, yes.txt-modified.).

Before reading on, if you have easier ways to do these things, then I'd
love to hear them!

# Part 1

I attempted Part 1 with this PR:
https://github.com/apache/beam/pull/3543

But I overgeneralized.

Specifically, I want:

Pipeline p = Pipeline.create(options);
PCollection wholeFilesAsStrings = p.apply("Read Whole Files from
Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));

or

Pipeline p = Pipeline.create(options);
PCollection wholeFileStreams = p.apply("Read Whole Files from
Input Directory", WholeFileIO.read().from("/path/to/input/dir/*"));

Bonus points would include:

   - Keeping the filename somehow


# Part 2

Currently, if you output hundreds of thousands of elements (batch-mode)
with TextIO in Beam on Flink with, say, 45 TaskManagers, then you get
output-0-of-00045, output-1-of-00045, etc., and each one of those
files contain tens of thousands of outputs back to back. I want them to be
output as individual files.

If appended after code snippets from Part 1, it would look like:

...
p.apply("Write Whole File Outputs",
WholeFileIO.write().to("/path/to/output/dir/"));

Bonus points would include:

   - Writing each element of the given PCollection to the filename they'd
   like to go to.
   - Parallelizable. (This might already be done, I just noticed that my
   Beam+Flink+YARN pipeline with TextIO.write() only had one TaskManager
   writing the DataSink output even though all other components of my pipeline
   had many TaskManagers working on them simultaneously. I haven't found the
   way to fix that yet. The current arrangement added 15 minutes to the end of
   my pipeline as the lonely TaskManager did all the output.)


I'm available to put the dev work into this. (Actually, I'm putting dev
time into some kind of solution whether this is agreed upon or not :).

Feedback, please,
Chris


Re: Towards a spec for robust streaming SQL, Part 2

2017-08-01 Thread Julian Hyde
I have problems with a couple of the axioms: that a SQL object is
either a table or a stream, but not both; and that a query is bounded
if and only if it contains no unbounded streams.

I don't have problems with other axioms, such that a query is either
bounded or unbounded. And I haven't looked in detail at triggering
semantics; I don't think there will be major issues, but let's clear
up the 2 problems above first.

I have added a section "Julian’s thoughts on the fundamentals" to the
end of the document.

Julian


On Tue, Aug 1, 2017 at 6:40 AM, Fabian Hueske  wrote:
> As promised, I went of the document and made some comments.
> I also added a bit of information about the current SQL support in Flink
> and its internals.
>
> Thanks, Fabian
>
> 2017-07-30 13:22 GMT+02:00 Shaoxuan Wang :
>
>> Hi Tyler,
>> Thanks for putting all the efforts into a doc. It is really well written
>> and organized.
>> I like the most part. The major concern I have is about the "explicit
>> trigger". I left a few comments towards this and would like to know what
>> the others think about it.
>>
>> Regards,
>> Shaoxuan
>>
>> On Sun, Jul 30, 2017 at 4:43 PM, Fabian Hueske  wrote:
>>
>> > Thanks for the great write up!
>> >
>> > I think this s very good starting point for a detailed discussion about
>> > features, syntax and semantics of streaming SQL.
>> > I'll comment on the document in the next days and describe Flink's
>> current
>> > status, our approaches (or planned approaches) and ask a couple of
>> > questions.
>> >
>> > Thanks, Fabian
>> >
>> > 2017-07-28 3:05 GMT+02:00 Julian Hyde :
>> >
>> > > Tyler,
>> > >
>> > > Thanks for this. I am reading the document thoroughly and will give my
>> > > feedback in a day or two.
>> > >
>> > > Julian
>> > >
>> > > > On Jul 25, 2017, at 12:54 PM, Pramod Immaneni <
>> pra...@datatorrent.com>
>> > > wrote:
>> > > >
>> > > > Thanks for the invitation Tyler. I am sure folks who worked on the
>> > > calcite
>> > > > integration and others would be interested.
>> > > >
>> > > > On Tue, Jul 25, 2017 at 12:12 PM, Tyler Akidau
>> > > 
>> > > > wrote:
>> > > >
>> > > >> +d...@apex.apache.org, since I'm told Apex has a Calcite integration
>> > as
>> > > >> well. If anyone on the Apex side wants to join in on the fun, your
>> > input
>> > > >> would be welcomed!
>> > > >>
>> > > >> -Tyler
>> > > >>
>> > > >>
>> > > >> On Mon, Jul 24, 2017 at 4:34 PM Tyler Akidau 
>> > > wrote:
>> > > >>
>> > > >>> Hello Flink, Calcite, and Beam dev lists!
>> > > >>>
>> > > >>> Linked below is the second document I promised way back in April
>> > > >> regarding
>> > > >>> a collaborative spec for streaming SQL in Beam/Calcite/Flink (&
>> > > apologies
>> > > >>> for the delay; I thought I was nearly done a while back and then
>> > > temporal
>> > > >>> joins expanded to something much larger than expected).
>> > > >>>
>> > > >>> To repeat what it says in the doc, my hope is that it can serve
>> > various
>> > > >>> purposes over it's lifetime:
>> > > >>>
>> > > >>>   -
>> > > >>>   - A discussion ground for ironing out any remaining features
>> > > necessary
>> > > >>>   for supporting robust streaming semantics in Calcite SQL.
>> > > >>>
>> > > >>>   - A rough, high-level source of truth for tracking efforts
>> underway
>> > > in
>> > > >>>   support of this, currently spanning the Calcite, Flink, and Beam
>> > > >> projects.
>> > > >>>
>> > > >>>   - A written specification of the changes that were made, for the
>> > sake
>> > > >>>   of understanding the delta after the fact.
>> > > >>>
>> > > >>> The first and third points are, IMO, the most important. AFAIK,
>> there
>> > > are
>> > > >>> a few features missing still that need to be defined (e.g.,
>> triggers
>> > > >>> equivalents via EMIT, robust temporal join support). I'm also
>> > > proposing a
>> > > >>> clear distinction of streams and tables, which I think is
>> important,
>> > > but
>> > > >>> which I believe is not the approach most folks have been taking in
>> > this
>> > > >>> area. Sorting out these open issues and then having a concise
>> record
>> > of
>> > > >> the
>> > > >>> solutions adopted will be important for providing a solid streaming
>> > > >>> experience and teaching folks how to use it.
>> > > >>>
>> > > >>> At any rate, I would much appreciate it if anyone with an interest
>> in
>> > > >> this
>> > > >>> stuff could please take a look and add comments/suggestions/
>> > references
>> > > >> to
>> > > >>> related work in flight/etc as appropriate. For now please use
>> > > >>> comments/suggestions, but if you really want to dive in with edit
>> > > access,
>> > > >>> let me know.
>> > > >>>
>> > > >>> The doc: http://s.apache.org/streaming-sql-spec
>> > > >>>
>> > > >>> -Tyler
>> > > >>>
>> > > >>>
>> > > >>>
>> > > >>
>> > >
>> > >
>> >
>>


Re: Towards a spec for robust streaming SQL, Part 2

2017-08-01 Thread Fabian Hueske
As promised, I went of the document and made some comments.
I also added a bit of information about the current SQL support in Flink
and its internals.

Thanks, Fabian

2017-07-30 13:22 GMT+02:00 Shaoxuan Wang :

> Hi Tyler,
> Thanks for putting all the efforts into a doc. It is really well written
> and organized.
> I like the most part. The major concern I have is about the "explicit
> trigger". I left a few comments towards this and would like to know what
> the others think about it.
>
> Regards,
> Shaoxuan
>
> On Sun, Jul 30, 2017 at 4:43 PM, Fabian Hueske  wrote:
>
> > Thanks for the great write up!
> >
> > I think this s very good starting point for a detailed discussion about
> > features, syntax and semantics of streaming SQL.
> > I'll comment on the document in the next days and describe Flink's
> current
> > status, our approaches (or planned approaches) and ask a couple of
> > questions.
> >
> > Thanks, Fabian
> >
> > 2017-07-28 3:05 GMT+02:00 Julian Hyde :
> >
> > > Tyler,
> > >
> > > Thanks for this. I am reading the document thoroughly and will give my
> > > feedback in a day or two.
> > >
> > > Julian
> > >
> > > > On Jul 25, 2017, at 12:54 PM, Pramod Immaneni <
> pra...@datatorrent.com>
> > > wrote:
> > > >
> > > > Thanks for the invitation Tyler. I am sure folks who worked on the
> > > calcite
> > > > integration and others would be interested.
> > > >
> > > > On Tue, Jul 25, 2017 at 12:12 PM, Tyler Akidau
> > > 
> > > > wrote:
> > > >
> > > >> +d...@apex.apache.org, since I'm told Apex has a Calcite integration
> > as
> > > >> well. If anyone on the Apex side wants to join in on the fun, your
> > input
> > > >> would be welcomed!
> > > >>
> > > >> -Tyler
> > > >>
> > > >>
> > > >> On Mon, Jul 24, 2017 at 4:34 PM Tyler Akidau 
> > > wrote:
> > > >>
> > > >>> Hello Flink, Calcite, and Beam dev lists!
> > > >>>
> > > >>> Linked below is the second document I promised way back in April
> > > >> regarding
> > > >>> a collaborative spec for streaming SQL in Beam/Calcite/Flink (&
> > > apologies
> > > >>> for the delay; I thought I was nearly done a while back and then
> > > temporal
> > > >>> joins expanded to something much larger than expected).
> > > >>>
> > > >>> To repeat what it says in the doc, my hope is that it can serve
> > various
> > > >>> purposes over it's lifetime:
> > > >>>
> > > >>>   -
> > > >>>   - A discussion ground for ironing out any remaining features
> > > necessary
> > > >>>   for supporting robust streaming semantics in Calcite SQL.
> > > >>>
> > > >>>   - A rough, high-level source of truth for tracking efforts
> underway
> > > in
> > > >>>   support of this, currently spanning the Calcite, Flink, and Beam
> > > >> projects.
> > > >>>
> > > >>>   - A written specification of the changes that were made, for the
> > sake
> > > >>>   of understanding the delta after the fact.
> > > >>>
> > > >>> The first and third points are, IMO, the most important. AFAIK,
> there
> > > are
> > > >>> a few features missing still that need to be defined (e.g.,
> triggers
> > > >>> equivalents via EMIT, robust temporal join support). I'm also
> > > proposing a
> > > >>> clear distinction of streams and tables, which I think is
> important,
> > > but
> > > >>> which I believe is not the approach most folks have been taking in
> > this
> > > >>> area. Sorting out these open issues and then having a concise
> record
> > of
> > > >> the
> > > >>> solutions adopted will be important for providing a solid streaming
> > > >>> experience and teaching folks how to use it.
> > > >>>
> > > >>> At any rate, I would much appreciate it if anyone with an interest
> in
> > > >> this
> > > >>> stuff could please take a look and add comments/suggestions/
> > references
> > > >> to
> > > >>> related work in flight/etc as appropriate. For now please use
> > > >>> comments/suggestions, but if you really want to dive in with edit
> > > access,
> > > >>> let me know.
> > > >>>
> > > >>> The doc: http://s.apache.org/streaming-sql-spec
> > > >>>
> > > >>> -Tyler
> > > >>>
> > > >>>
> > > >>>
> > > >>
> > >
> > >
> >
>


Re: [DISCUSS] Beam MapReduce Runner One-Pager

2017-08-01 Thread Pei HE
I prototyped a simple MapReduce runner that can executes
Read.Bounded+ParDo+Combine:
https://github.com/peihe/incubator-beam/tree/mr-runner

I am still working on View support, and will give another update once I can
get WordCount runs.

On Sat, Jul 15, 2017 at 4:45 AM, Vikas RK  wrote:

> Thanks Pei, left a few comments, but this looks exciting!
>
> -Vikas
>
> On 12 July 2017 at 21:52, Jean-Baptiste Onofré  wrote:
>
> > Hi,
> >
> > I will push my branch with the current state of the mapreduce runner.
> >
> > Regards
> > JB
> >
> >
> > On 07/13/2017 04:47 AM, Pei HE wrote:
> >
> >> Thanks guys!
> >>
> >> I replied Kenn's comments, and looking forward to more feedbacks and
> >> suggestions.
> >>
> >> Also, could we add a mapreduce-runner branch?
> >>
> >> Thanks
> >> --
> >> Pei
> >>
> >>
> >> On Sat, Jul 8, 2017 at 12:42 AM, Kenneth Knowles  >
> >> wrote:
> >>
> >> Very cool to see this. Commenting a little on the doc.
> >>>
> >>> On Fri, Jul 7, 2017 at 8:41 AM, Jean-Baptiste Onofré 
> >>> wrote:
> >>>
> >>> Hi Pei,
> 
>  I also pumped some ideas and part of code from Crunch for the
> MapReduce
>  runner.
> 
>  I will push my changes on my github branch and share with you.
> 
>  Let me take a look on your doc.
> 
>  Regards
>  JB
> 
> 
>  On 07/07/2017 03:11 PM, Pei HE wrote:
> 
>  Hi all,
> > While JB is working on MapReduce Runner BEAM-165
> > , I have spent time
> > reading
> > Apache Crunch code and drafted Beam MapReduce Runner One-Pager
> >  > ZG1MU-F47sWg8N6xkBM0/edit#heading=h.bewnehqnt4zd>
> > (mostly
> > around ParDo/Flatten fusion support, and with many missing details).
> >
> > I would like to start the discussion, and get people's attention of
> > supporting MapReduce in Beam.
> >
> > Feel free to make comments and suggestions on that doc.
> >
> > Thanks
> > --
> > Pei
> >
> >
> > --
>  Jean-Baptiste Onofré
>  jbono...@apache.org
>  http://blog.nanthrax.net
>  Talend - http://www.talend.com
> 
> 
> >>>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Build failed in Jenkins: beam_Release_NightlySnapshot #494

2017-08-01 Thread Apache Jenkins Server
See 


Changes:

[tgroh] Programmatically Create Beam Jenkins View.

[jbonofre] [BEAM-2530] Upgrade maven plugins to the latest versions

--
[...truncated 317.54 KB...]
2017-08-01T07:02:33.648 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/reporting/maven-reporting-impl/2.1/maven-reporting-impl-2.1.jar
 (17 KB at 60.6 KB/sec)
2017-08-01T07:02:33.648 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-settings/2.0.10/maven-settings-2.0.10.jar
2017-08-01T07:02:33.653 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/ant/ant/1.9.4/ant-1.9.4.jar 
(1972 KB at 7249.8 KB/sec)
2017-08-01T07:02:33.653 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-profile/2.0.10/maven-profile-2.0.10.jar
2017-08-01T07:02:33.678 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-settings/2.0.10/maven-settings-2.0.10.jar
 (50 KB at 167.7 KB/sec)
2017-08-01T07:02:33.678 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-plugin-registry/2.0.10/maven-plugin-registry-2.0.10.jar
2017-08-01T07:02:33.680 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-profile/2.0.10/maven-profile-2.0.10.jar
 (37 KB at 120.5 KB/sec)
2017-08-01T07:02:33.680 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-interpolation/1.1/plexus-interpolation-1.1.jar
2017-08-01T07:02:33.690 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-project/2.0.10/maven-project-2.0.10.jar
 (121 KB at 389.3 KB/sec)
2017-08-01T07:02:33.690 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/commons-validator/commons-validator/1.2.0/commons-validator-1.2.0.jar
2017-08-01T07:02:33.701 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/com/google/code/findbugs/findbugs/3.0.1/findbugs-3.0.1.jar
 (3711 KB at 11559.6 KB/sec)
2017-08-01T07:02:33.701 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-plugin-api/2.0/maven-plugin-api-2.0.jar
2017-08-01T07:02:33.706 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-plugin-registry/2.0.10/maven-plugin-registry-2.0.10.jar
 (30 KB at 91.3 KB/sec)
2017-08-01T07:02:33.706 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/doxia/doxia-core/1.4/doxia-core-1.4.jar
2017-08-01T07:02:33.707 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-interpolation/1.1/plexus-interpolation-1.1.jar
 (35 KB at 106.3 KB/sec)
2017-08-01T07:02:33.707 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/shared/maven-doxia-tools/1.2.1/maven-doxia-tools-1.2.1.jar
2017-08-01T07:02:33.716 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/groovy/groovy-xml/2.4.7/groovy-xml-2.4.7.jar
 (211 KB at 628.1 KB/sec)
2017-08-01T07:02:33.716 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-artifact-manager/2.0.6/maven-artifact-manager-2.0.6.jar
2017-08-01T07:02:33.724 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/commons-validator/commons-validator/1.2.0/commons-validator-1.2.0.jar
 (89 KB at 259.4 KB/sec)
2017-08-01T07:02:33.724 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-repository-metadata/2.0.6/maven-repository-metadata-2.0.6.jar
2017-08-01T07:02:33.726 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-plugin-api/2.0/maven-plugin-api-2.0.jar
 (10 KB at 28.7 KB/sec)
2017-08-01T07:02:33.726 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-model/2.0.6/maven-model-2.0.6.jar
2017-08-01T07:02:33.734 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/shared/maven-doxia-tools/1.2.1/maven-doxia-tools-1.2.1.jar
 (43 KB at 119.5 KB/sec)
2017-08-01T07:02:33.734 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-utils/1.5.6/plexus-utils-1.5.6.jar
2017-08-01T07:02:33.743 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/doxia/doxia-core/1.4/doxia-core-1.4.jar
 (162 KB at 444.8 KB/sec)
2017-08-01T07:02:33.746 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-artifact-manager/2.0.6/maven-artifact-manager-2.0.6.jar
 (56 KB at 151.2 KB/sec)
2017-08-01T07:02:33.752 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-repository-metadata/2.0.6/maven-repository-metadata-2.0.6.jar
 (24 KB at 64.4 KB/sec)
2017-08-01T07:02:33.756 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/maven/maven-model/2.0.6/maven-model-2.0.6.jar
 (85 KB at 225.0 KB/sec)
2017-08-01T07:02:33.776 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/codehaus/plexus/plexus-util